GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD
|
|
- Alexander Harrington
- 5 years ago
- Views:
Transcription
1 GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD
2 GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES Jason Yang, Takahiro Harada AMD
3 HYBRID CPU-GPU ARCHITECTURE CPU and GPU are good at different works CPU: serial computation, conditional branch GPU: parallel computation CPU and GPU are: On the same die Much closer Efficient data sharing AMD Fusion Ontario Llano Mobile: A8-3530MX CPU(1.9GHz-2.6GHz) GPU(HD6620G, 5 SIMD engines) Desktop: A8-3870K CPU(3.1GHz) GPU(HD6550D) Able to dispatch works to: Serial work with varying granularity CPU Parallel work with the uniform granularity GPU 3 Physics simulation on fusion architecture June, 2011
4 PHYSICS SIMULATION Particles Cloth Elastic bodies Rigid bodies Fluids 4 Physics simulation on fusion architecture June, 2011
5 PARTICLE BASED SIMULATION ON THE GPU Large number of particles Particles with identical size Work granularity is almost the same Good for the wide SIMD architecture Harada et al Physics simulation on fusion architecture June, 2011
6 GPU ARCHITECTURE AMD Radeon TM HD TFLOPS(S), 638GFLOPS(D), 176GB/sec Many cores 24SIMDs x 64 wide SIMD CPU SSE 4 wide SIMD Program of a work item is packed in VLIW, then executed SIMD SIMD SIMD SIMD AMD Radeon TM HD 6970 SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD AMD Phenom TM II X4 Core Core Core Core SIMD Core 6 Physics simulation on fusion architecture June, 2011
7 PARTICLE BASED SIMULATION f collide = f ij Integration dv dt = f m dx dt = v v += f m t x += v t Acceleration structure is used for efficient collide Uniform grid Suited for the GPU Less divergence 7 Physics simulation on fusion architecture June, 2011
8 DIVERGENCE ON SIMD Void Kernel() { if(a) FuncA(); else if(b) FuncB(); else FuncC(); } 8 Physics simulation on fusion architecture June, 2011
9 PARTICLE BASED SIMULATION ON THE GPU Particle collision using a uniform grid Void Kernel() { prepare(); Cell0 Cell1 Cell2 Cell3 Cell4 Cell5 Cell6 Cell7 Cell8 collide(cell0); collide(cell1); collide(cell2); collide(cell3); collide(cell4); collide(cell5); } collide(cell6); collide(cell7); collide(cell8); 9 Physics simulation on fusion architecture June, 2011
10 MIXED PARTICLE SIMULATION Not only small particles Difficulty for GPUs Large particles interact with small particles Large-large collision 10 Physics simulation on fusion architecture June, 2011
11 CHALLENGE Non uniform work granularity Small-small(SS) collision GPU Large-small(LS) collision CPU Large-large(LL) collision CPU Discrete CPU+GPU Need data transfer 11 Physics simulation on fusion architecture June, 2011
12 FUSION ARCHITECTURE CPU and GPU are: On the same die Much closer Efficient data sharing CPU and GPU are good at different works CPU: serial computation, conditional branch GPU: parallel computation Able to dispatch works to: Serial work with varying granularity CPU Parallel work with the uniform granularity GPU 12 Physics simulation on fusion architecture June, 2011
13 MIXED PARTICLE SIMULATION Benefit from Fusion Architecture Different works in a simulation CPU & GPU are working together Shares data 13 Physics simulation on fusion architecture June, 2011
14 LS TWO SIMULATIONS Small particles Position Velocity Force Grid Build Acc. Structure SS S Integration Large particles Position Velocity Force Build Acc. Structure LL L Integration 14 Physics simulation on fusion architecture June, 2011
15 CLASSIFY BY WORK GRANULARITY Small particles Position Velocity Force Grid Build Acc. Structure Uniform Work SS S Integration Large particles Position Velocity Force Build Acc. Structure Non Uniform Work LL LS L Integration 15 Physics simulation on fusion architecture June, 2011
16 CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR Small particles Position Velocity Force Grid Build Acc. Structure GPU SS S Integration Large particles Position Velocity Force Build Acc. Structure LL CPU LS L Integration 16 Physics simulation on fusion architecture June, 2011
17 DATA SHARING Small particles Position Velocity Force Grid Build Acc. Structure GPU SS S Integration Large particles Position Velocity Grid Force Position Velocity Force Build Acc. Structure LL CPU LS L Integration Grid, small particle data has to be shared with the CPU for LS collision Allocated as zero copy buffer 17 Physics simulation on fusion architecture June, 2011
18 Synchronization Synchronization SYNCHRONIZATION Small particles Position Velocity Force Grid Build Acc. Structure GPU SS S Integration Large particles Position Velocity Grid Force Position Velocity Force Build Acc. Structure LL CPU LS L Integration Grid, small particle data has to be shared with the CPU for LS collision Allocated as zero copy buffer 18 Physics simulation on fusion architecture June, 2011
19 L Integration Synchronization S Integration VISUALIZING WORKLOADS Small particles Position Velocity Force Grid SS GPU Build Acc. Structure Large particles Position Velocity Force LL LS CPU Grid construction can be moved at the end of the pipeline Unbalanced workload 19 Physics simulation on fusion architecture June, 2011
20 L Integration Synchronization S Integration LOAD BALANCING Small particles Position Velocity Force Grid SS GPU Build Acc. Structure Large particles Position Velocity Force LS CPU LL To get better load balancing The sync is for passing the force buffer filled by the CPU to the GPU Move the LL collision after the sync 20 Physics simulation on fusion architecture June, 2011
21 CPU Work GPU Work 21 Physics simulation on fusion architecture June, 2011
22 CPU Work GPU Work 22 Physics simulation on fusion architecture June, 2011
23 UTILIZING 4 CPU CORES 23 Physics simulation on fusion architecture June, 2011
24 Synchronization L Integ. S Integ. FURTHER OPTIMIZATION 1. Not optimized for Llano which is a 4 core CPU Only 2 CPU core were used Can use 2 more cores for LS collision 2. LL collision was not optimized CPU waits when the GPU was constructing a grid Use CPU to improve SS collision SS LS GPU CPU0 CPU1 LL Build Acc. Structure CPU2 24 Physics simulation on fusion architecture June, 2011
25 OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION Cannot split the work by large particle indices More than 1 large particle can collide with a small particle Have to lock the memory on write Inefficient L0 S0 S1 L1 Prepare a local buffer for a thread A buffer storing force on small particles Lock free Local buffers are merged to one Thread0 Thread1 Thread2 25 Physics simulation on fusion architecture June, 2011
26 Synchronization L Integ. S Integ. OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION SS GPU Build Acc. Structure LS CPU0 LL CPU1 CPU2 26 Physics simulation on fusion architecture June, 2011
27 Merge Synchronization Merge Synchronization Merge L Integ. S Integ. OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION SS GPU Build Acc. Structure CPU0 LS LL CPU1 LS CPU2 LS 27 Physics simulation on fusion architecture June, 2011
28 OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION Spatially coherent memory layout improves cache utilization As particles move, spatial locality decreases 28 Physics simulation on fusion architecture June, 2011
29 OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION Spatially coherent memory layout improves cache utilization As particles move, spatial locality decreases 29 Physics simulation on fusion architecture June, 2011
30 SPATIAL SORT Sort particles by spatial location to improve cache utilization Z curve 30 Physics simulation on fusion architecture June, 2011
31 SPATIAL SORT Sort particles by spatial location to improve cache utilization Z curve 31 Physics simulation on fusion architecture June, 2011
32 CHOOSE SORT Requirements Full sort was over the budget Full sort is not a must Sort is an optional computation for performance improvement Incremental sort Use multiple threads Solution Used generalized Odd-even transition sort 32 Physics simulation on fusion architecture June, 2011
33 BLOCK TRANSITION SORT Generalized Odd-even transition sort Instead of sorting 2 adjacent elements, sort adjacent 2 blocks Iterate until convergence Odd-even transition sort Use a thread to sort 2 adjacent blocks 6 blocks for 3 threads Radix sort Block transition sort 33 Physics simulation on fusion architecture June, 2011
34 Merge Synchronization Merge Synchronization Merge L Integ. S Integ. OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION SS GPU Build Acc. Structure CPU0 LS LL CPU1 LS CPU2 LS 34 Physics simulation on fusion architecture June, 2011
35 Merge Synchronization Merge Synchronization Synchronization Merge LL Coll. L Integ. S Integ. OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION SS GPU Build Acc. Structure CPU0 LS S Sorting CPU1 LS S Sorting CPU2 LS S Sorting 35 Physics simulation on fusion architecture June, 2011
36 DATA DEPENDENCIES L Build Acc. S Build Acc. S Broad phase S Sort Sort Sort LS Collide LS Collide LS Collide M LL Collide M Merge Merge Merge M SS Collide LL Integrate S Integrate 36 Physics simulation on fusion architecture June, 2011
37 CPU Work GPU Work DEMO 37 Physics simulation on fusion architecture June, 2011
38 CPU Work GPU Work DEMO 38 Physics simulation on fusion architecture June, 2011
39 CONCLUSIONS Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU and GPU on AMD s Fusion Architecture The CPU is used for works with non identical compute granularity The GPU is used for highly parallel works Memory sharing between the CPU and GPU is the key for the efficiency Avoid wasteful memory copies 39 Physics simulation on fusion architecture June, 2011
40 REFERENCE Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs, Proc. of Computer Graphics International, 63-70(2007) Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation, Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011) 40 Physics simulation on fusion architecture June, 2011
Bullet Cloth Simulation. Presented by: Justin Hensley Implemented by: Lee Howes
Bullet Cloth Simulation Presented by: Justin Hensley Implemented by: Lee Howes Course Agenda OpenCL Review! OpenCL Development Tips! OpenGL/OpenCL Interoperability! Rigid body particle simulation!
More informationAbout Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.
About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS. Phoenix FD core SIMULATION & RENDERING. SIMULATION CORE - GRID-BASED
More informationShape of Things to Come: Next-Gen Physics Deep Dive
Shape of Things to Come: Next-Gen Physics Deep Dive Jean Pierre Bordes NVIDIA Corporation Free PhysX on CUDA PhysX by NVIDIA since March 2008 PhysX on CUDA available: August 2008 GPU PhysX in Games Physical
More informationUse cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games
Viewdle Inc. 1 Use cases Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games 2 Why OpenCL matter? OpenCL is going to bring such
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationWeb Physics: A Hardware Accelerated Physics Engine for Web- Based Applications
Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications Tasneem Brutch, Bo Li, Guodong Rong, Yi Shen, Chang Shu Samsung Research America-Silicon Valley {t.brutch, robert.li, g.rong,
More informationWhat is a Rigid Body?
Physics on the GPU What is a Rigid Body? A rigid body is a non-deformable object that is a idealized solid Each rigid body is defined in space by its center of mass To make things simpler we assume the
More informationCUDA Particles. Simon Green
CUDA Particles Simon Green sdkfeedback@nvidia.com Document Change History Version Date Responsible Reason for Change 1.0 Sept 19 2007 Simon Green Initial draft Abstract Particle systems [1] are a commonly
More informationPhysics Parallelization. Erwin Coumans SCEA
Physics Parallelization A B C Erwin Coumans SCEA Takeaway Multi-threading the physics pipeline collision detection and constraint solver Refactoring algorithms and data for SPUs Maintain unified multi-platform
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationFall 2011 PhD Qualifier Exam
Computer Architecture Area Fall 2011 PhD Qualifier Exam November 18 th 2011 This exam has six (6) equally- weighted problems. You must submit your answers to all six problems. Write your answers/solutions
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationAMD Fusion APU: Llano. Marcello Dionisio, Roman Fedorov Advanced Computer Architectures
AMD Fusion APU: Llano Marcello Dionisio, Roman Fedorov Advanced Computer Architectures Outline Introduction AMD Llano architecture AMD Llano CPU core AMD Llano GPU Memory access management Turbo core technology
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationAsynchronous OpenCL/MPI numerical simulations of conservation laws
Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation
More informationINTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD
INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationviewdle! - machine vision experts
viewdle! - machine vision experts topic using algorithmic metadata creation and heterogeneous computing to build the personal content management system of the future Page 2 Page 3 video of basic recognition
More informationScalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009
Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation
More informationParallel Programming for Graphics
Beyond Programmable Shading Course ACM SIGGRAPH 2010 Parallel Programming for Graphics Aaron Lefohn Advanced Rendering Technology (ART) Intel What s In This Talk? Overview of parallel programming models
More informationNumerical Simulation on the GPU
Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle
More informationCUDA Particles. Simon Green
CUDA Particles Simon Green sdkfeedback@nvidia.com Document Change History Version Date Responsible Reason for Change 1.0 Sept 19 2007 Simon Green Initial draft 1.1 Nov 3 2007 Simon Green Fixed some mistakes,
More informationReal-time Graphics 9. GPGPU
9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationParticle Simulation using CUDA. Simon Green
Particle Simulation using CUDA Simon Green sdkfeedback@nvidia.com July 2012 Document Change History Version Date Responsible Reason for Change 1.0 Sept 19 2007 Simon Green Initial draft 1.1 Nov 3 2007
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationHigh Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters
SIAM PP 2014 High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters C. Riesinger, A. Bakhtiari, M. Schreiber Technische Universität München February 20, 2014
More informationLecture 16. Introduction to Game Development IAP 2007 MIT
6.189 IAP 2007 Lecture 16 Introduction to Game Development Mike Acton, Insomiac Games. 6.189 IAP 2007 MIT Introduction to Game Development (on the Playstation 3 / Cell ) Mike Acton Engine Director, Insomniac
More informationPerformance Optimization Part II: Locality, Communication, and Contention
Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11
More informationLATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS
LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS NAVIER-STOKES EQUATIONS u t + u u + 1 ρ p = Ԧg + ν u u=0 WHAT IS COMPUTATIONAL FLUID DYNAMICS? Branch of Fluid Dynamics which uses computer power to approximate
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationUberFlow: A GPU-Based Particle Engine
UberFlow: A GPU-Based Particle Engine Peter Kipfer Mark Segal Rüdiger Westermann Technische Universität München ATI Research Technische Universität München Motivation Want to create, modify and render
More informationGPU ARCHITECTURE Chris Schultz, June 2017
GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationDiscrete Element Method
Discrete Element Method Midterm Project: Option 2 1 Motivation NASA, loyalkng.com, cowboymining.com Industries: Mining Food & Pharmaceutics Film & Game etc. Problem examples: Collapsing Silos Mars Rover
More informationAntonio R. Miele Marco D. Santambrogio
Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationNVIDIA. Interacting with Particle Simulation in Maya using CUDA & Maximus. Wil Braithwaite NVIDIA Applied Engineering Digital Film
NVIDIA Interacting with Particle Simulation in Maya using CUDA & Maximus Wil Braithwaite NVIDIA Applied Engineering Digital Film Some particle milestones FX Rendering Physics 1982 - First CG particle FX
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng
More informationCUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata
CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids
More informationHandout 3. HSAIL and A SIMT GPU Simulator
Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants
More informationPraktikum 2014 Parallele Programmierung Universität Hamburg Dept. Informatics / Scientific Computing. October 23, FluidSim.
Praktikum 2014 Parallele Programmierung Universität Hamburg Dept. Informatics / Scientific Computing October 23, 2014 Paul Bienkowski Author 2bienkow@informatik.uni-hamburg.de Dr. Julian Kunkel Supervisor
More informationROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationChapter 6 Solutions S-3
6 Solutions Chapter 6 Solutions S-3 6.1 There is no single right answer for this question. The purpose is to get students to think about parallelism present in their daily lives. The answer should have
More informationParallel Programming
Parallel Programming OpenMP Nils Moschüring PhD Student (LMU) Nils Moschüring PhD Student (LMU), OpenMP 1 1 Overview What is parallel software development Why do we need parallel computation? Problems
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationAnatomy of AMD s TeraScale Graphics Engine
Anatomy of AMD s TeraScale Graphics Engine Mike Houston Design Goals Focus on Efficiency f(perf/watt, Perf/$) Scale up processing power and AA performance Target >2x previous generation Enhance stream
More information7 Solutions. Solution 7.1. Solution 7.2
7 Solutions Solution 7.1 There is no single right answer for this question. The purpose is to get students to think about parallelism present in their daily lives. The answer should have at least 10 activities
More informationGPU ARCHITECTURE Chris Schultz, June 2017
Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE Problems Solved Over Time versus Why are they different? Complex
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationProfiling and Debugging OpenCL Applications with ARM Development Tools. October 2014
Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationGPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011
GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput
More informationAdministration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:
CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.126A ACES Email: pingali@cs.utexas.edu TA: Aditya Rawal Email: 83.aditya.rawal@gmail.com University of Texas,
More informationHeterogeneous Architecture. Luca Benini
Heterogeneous Architecture Luca Benini lbenini@iis.ee.ethz.ch Intel s Broadwell 03.05.2016 2 Qualcomm s Snapdragon 810 03.05.2016 3 AMD Bristol Ridge Departement Informationstechnologie und Elektrotechnik
More informationADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer
ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session 2117 Olivier Zegdoun AMD Sr. Software Engineer CONTENTS Rendering Effects Before Fusion: single discrete GPU case Before Fusion: multiple discrete
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The
More informationInternational Supercomputing Conference 2009
International Supercomputing Conference 2009 Implementation of a Lattice-Boltzmann-Method for Numerical Fluid Mechanics Using the nvidia CUDA Technology E. Riegel, T. Indinger, N.A. Adams Technische Universität
More informationIntroducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method
Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System
More informationSoftware and Performance Engineering for numerical codes on GPU clusters
Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3
More informationProgrammer's View of Execution Teminology Summary
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs
More informationGPU A rchitectures Architectures Patrick Neill May
GPU Architectures Patrick Neill May 30, 2014 Outline CPU versus GPU CUDA GPU Why are they different? Terminology Kepler/Maxwell Graphics Tiled deferred rendering Opportunities What skills you should know
More informationIntroduction to Parallel Programming For Real-Time Graphics (CPU + GPU)
Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU) Aaron Lefohn, Intel / University of Washington Mike Houston, AMD / Stanford 1 What s In This Talk? Overview of parallel programming
More informationParticle-based Fluid Simulation
Simulation in Computer Graphics Particle-based Fluid Simulation Matthias Teschner Computer Science Department University of Freiburg Application (with Pixar) 10 million fluid + 4 million rigid particles,
More informationAMD Elite A-Series APU Desktop LAUNCHING JUNE 4 TH PLACE YOUR ORDERS TODAY!
AMD Elite A-Series APU Desktop LAUNCHING JUNE 4 TH PLACE YOUR ORDERS TODAY! INTRODUCING THE APU: DIFFERENT THAN CPUS A new AMD led category of processor APUs Are Their OWN Category + = Up to 779 GFLOPS
More informationAlexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria
Alexei Katranov IWOCL '16, April 21, 2016, Vienna, Austria Hardware: customization, integration, heterogeneity Intel Processor Graphics CPU CPU CPU CPU Multicore CPU + integrated units for graphics, media
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationComputer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2017 Review Thread, Multithreading, SMT CMP and multicore Benefits of
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationGPGPU Lessons Learned. Mark Harris
GPGPU Lessons Learned Mark Harris General-Purpose Computation on GPUs Highly parallel applications Physically-based simulation image processing scientific computing computer vision computational finance
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationIntroduction to Parallel Programming Models
Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationSCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL
SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt I: INTRODUCTION 3 Scaling
More informationMasterpraktikum Scientific Computing
Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationVisual Analysis of Lagrangian Particle Data from Combustion Simulations
Visual Analysis of Lagrangian Particle Data from Combustion Simulations Hongfeng Yu Sandia National Laboratories, CA Ultrascale Visualization Workshop, SC11 Nov 13 2011, Seattle, WA Joint work with Jishang
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationFrom Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)
From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre) Overview Complex
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationUltra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.
Abstract Ultra Large-Scale FFT Processing on Graphics Processor Arrays Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Graphics Processor Unit (GPU) technology has been shown well-suited to efficient
More informationGraphics Hardware 2008
AMD Smarter Choice Graphics Hardware 2008 Mike Mantor AMD Fellow Architect michael.mantor@amd.com GPUs vs. Multi-core CPUs On a Converging Course or Fundamentally Different? Many Cores Disruptive Change
More informationPorting a parallel rotor wake simulation to GPGPU accelerators using OpenACC
DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)
More informationHeterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1
COSCOⅣ Heterogeneous SoCs M5171111 HASEGAWA TORU M5171112 IDONUMA TOSHIICHI May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1 Contents Background Heterogeneous technology May 28, 2014 COMPUTER SYSTEM COLLOQUIUM
More informationHeterogeneous-Race-Free Memory Models
Heterogeneous-Race-Free Memory Models Jyh-Jing (JJ) Hwang, Yiren (Max) Lu 02/28/2017 1 Outline 1. Background 2. HRF-direct 3. HRF-indirect 4. Experiments 2 Data Race Condition op1 op2 write read 3 Sequential
More informationDeveloping PIC Codes for the Next Generation Supercomputer using GPUs. Viktor K. Decyk UCLA
Developing PIC Codes for the Next Generation Supercomputer using GPUs Viktor K. Decyk UCLA Abstract The current generation of supercomputer (petaflops scale) cannot be scaled up to exaflops (1000 petaflops),
More information