1 Presentation Title Month ##, 2012
|
|
- Frank Harper
- 5 years ago
- Views:
Transcription
1 1 Presentation Title Month ##, 2012
2 Malloc in OpenCL kernels Why and how? Roy Spliet Bsc. Delft University of Technology Student Msc. Dr. A.L. Varbanescu Prof. Dr. Ir. H.J. Sips Delft University of Technology Dr. B.R. Gaster Dr. L.W. Howes Advanced Micro Devices 2 Presentation Title Month ##, 2012
3 Why? Environment 3 Presentation Title Month ##, 2012
4 Why? Environment Thousands of work-items Maintain consistent heap state difficult Single Instruction, Multiple Threads Avoid divergent branching Assume work-items are in the exact same instruction of program Goal is problem solving, not communication Speed! Hardware limitations 5 Malloc in OpenCL Kernels: why and how? June 12th, 2012
5 Why? Environment OpenCL kernels cannot request memory from host Memory managed in device driver Only option: Allocate in advance Round trip to host would be expensive What happens on the device, stays on the device Pointers not valid on host system Context gets lost on transfer (Pre-allocated space: work with offsets instead) Speed Allocating on device means allocating on host, then on device Allocating on host is faster 6 Malloc in OpenCL Kernels: why and how? June 12th, 2012
6 Why? Environment Thousands of work-items Maintain consistent heap state difficult Single Instruction, Multiple Threads Avoid divergent branching Assume work-items are in the exact same instruction of program Goal is problem solving, not communication Speed! Hardware limitations No GPU->Host communication: no solution for memory overestimation Lost context: only use for temporary variables Overhead: avoid using malloc() when possible in-kernel 7 Malloc in OpenCL Kernels: why and how? June 12th, 2012
7 Why? Research 8 Presentation Title Month ##, 2012
8 Why? Research Set-up Use-case study: Find and investigate a diverse set of parallel programs and categorise their usage of: Malloc() Call frequency Object size Object amount Free() Call frequency Allocated memory chunks Global/local (read) access Access pattern 9 Malloc in OpenCL Kernels: why and how? June 12th, 2012
9 Why? Research Set-up Algorithm Class Program Source Library Finite State Machine Combinatorial Graph Traversal Structured Grid Dense Linear Algebra Sparse Matrix Spectral (FFT) Dynamic Programming N-Body/Particle Methods MapReduce Backtracking Unstructured Grid 1. Krste Asanovic et al. A view of the parallel computing landscape. Commun. ACM, 52:56 67, October Malloc in OpenCL Kernels: why and how? June 12th, 2012
10 Why? Research Set-up Algorithm Class Program Source Library Finite State Machine Level-7 filtering Case Hellas University Combinatorial Graph Traversal Graph Analysis Code TU Delft OpenCL Structured Grid Heart Wall Code Rodinia OpenMP Dense Linear Algebra K-Means Code Rodinia OpenMP Sparse Matrix SPMV Code Parboil Cuda Spectral (FFT) FFT Code Parboil Cuda Dynamic Programming Dijkstra Theory N-Body/Particle Methods Barnes-Hut Code Texas State University OpenCL MapReduce Backtracking Unstructured Grid Back-propagation Code Rodinia OpenMP 1. Krste Asanovic et al. A view of the parallel computing landscape. Commun. ACM, 52:56 67, October Malloc in OpenCL Kernels: why and how? June 12th, 2012
11 Why? Research Results Finite State Machine Dynamic programming Arbitrary sized worklist - scheduling Graph Traversal Worklist scheduling Graph N-Body simulation Octree scheduling 12 Malloc in OpenCL Kernels: why and how? June 12th, 2012
12 Why? Research Results Process input character for single NFA in parallel Non-deterministic Finite Automaton Next state not deterministic 13 Malloc in OpenCL Kernels: why and how? June 12th, 2012
13 Why? Research Results Process input character for single NFA in parallel Non-deterministic Finite Automaton Next state not deterministic Input A : 4 states Dynamic sized work queue: use 4 threads for next input word in parallel Specific case of graph traversal (just like dynamic programming) Unknown task list size: overcompensate or malloc() 14 Malloc in OpenCL Kernels: why and how? June 12th, 2012
14 Why? Research Results Properties of use-cases: Memory is allocated possibly many times, but free'd once in the end Allocated memory is always an array of equally sized objects Each work-item accesses own memory linearly or with fixed intervals On GPU's memory bus this will correspond to random access But try a generic design Object allocation in C++ kernels 15 Malloc in OpenCL Kernels: why and how? June 12th, 2012
15 Why? Research Results Other programs Memory chunk size determined on host Proportional to input Proportional to number of threads Data uploaded or downloaded Or no global storage at all Local variables a lot faster 16 Malloc in OpenCL Kernels: why and how? June 12th, 2012
16 Why? 17 Presentation Title Month ##, 2012
17 Why? It's not trivial Identified use-cases: demand for more versatile memory management Maintainability of OpenCL kernels Shorter development time Heap- and list management optimised together Learn more about users needs Determine if limitations should be eliminated based on user feedback We write code for memory management, so you don't have to 18 Malloc in OpenCL Kernels: why and how? June 12th, 2012
18 How? Design proposal 19 Presentation Title Month ##, 2012
19 How? Design proposal Requirements Use-case study Lists of equal-sized objects Allocate per-iteration Free entire lists when done Global access But generic design Platform Thread-safe Fast Improve locality, fill up memory bus 20 Malloc in OpenCL Kernels: why and how? June 12th, 2012
20 How? Design proposal ArrayList AddXXX(): Improve performance by calling heap manager as little as possible Prefix-sum reduction to gather memory requirements: O(log p) 2 List of equal-sized objects Item(): global access Clear(): free all at once Heap manager Optimised for relatively large chunks Traditional use, No limitations 2. Xiaohuang Huang et al. Xmalloc: A scalable lock-free dynamic memory allocator for many-core machines. In Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on, pages , july Malloc in OpenCL Kernels: why and how? June 12th, 2012
21 How? Challenges 22 Presentation Title Month ##, 2012
22 How? Challenges Heap management algorithm Platform Thread-safe Scalable Fast Efficient memory allocation Low fragmentation High performance Use-cases Generic Efficient for use with ArrayList objects. eg. Medium sized memory chunks 23 Malloc in OpenCL Kernels: why and how? June 12th, 2012
23 How? Challenges Heap management algorithm Hoard 3 Local heap for each processor (or more) to fight concurrent access Small objects: (multi-page) Superblocks with equally-sized objects Large objects: directly served pages Not efficient with thousands of threads Low utilisation of malloc(), relatively small objects per thread superblocks will not fill up. Administration per-thread larger than memory-demand per-thread. 3. Emery D. Berger et al. Hoard: a scalable memory allocator for multithreaded applications. SIGPLAN Not., 35:117 28, November Malloc in OpenCL Kernels: why and how? June 12th, 2012
24 How? Challenges Heap management algorithm DLMalloc 4 One heap Small objects: pages of equally sized blocks Medium objects: best fit from available memory Large objects: serve directly from OS Free memory blocks linked together in double-linked list, categorised in size buckets, ordered by size. Serve directly from OS not possible Best-fit (with coalescing) gives initial problems 4. D Lea. A memory allocator, October Malloc in OpenCL Kernels: why and how? June 12th, 2012
25 How? Challenges Heap management algorithm Best-fit Blocks of arbitrary sizes, possibly aligned Free blocks linked together, categorised in size buckets Free blocks as large as possible: coalescing adjacent free blocks Split off desired chunk on allocation 26 Malloc in OpenCL Kernels: why and how? June 12th, 2012
26 How? Challenges Heap management algorithm Best-fit Blocks of arbitrary sizes, possibly aligned Free blocks linked together, categorised in size buckets Free blocks as large as possible: coalescing adjacent free blocks Split off desired chunk on allocation 27 Malloc in OpenCL Kernels: why and how? June 12th, 2012
27 How? Challenges Heap management algorithm DLMalloc 4 One heap Small objects: pages of equally sized blocks Medium objects: best fit from available memory Large objects: serve directly from OS Free memory blocks linked together in double-linked list, categorised in size buckets, ordered by size. Serve directly from OS not possible Best-fit (with coalescing) gives initial problems One large block available, many threads requiring a small chunk One at the time take entire block, split own chunk off, free the rest for next thread 4. D Lea. A memory allocator, October Malloc in OpenCL Kernels: why and how? June 12th, 2012
28 How? Challenges Locking or Lock-Free atom_cmpxchg: Atomically compare, exchange if equal atom_cmpxchg(&var, a, b) { old = *var; } If (old == a) *var = b; return old; 29 Malloc in OpenCL Kernels: why and how? June 12th, 2012
29 How? Challenges Locking or Lock-Free Locking: Does not scale well Performance linear with the number of threads Is straightforward to implement Lock-free: Scales better on multi-core CPU's But not necessarily on GPU's CMPXCHG instruction fails a lot when executed in SIMD-like work-groups And is complex to implement 30 Malloc in OpenCL Kernels: why and how? June 12th, 2012
30 How? Challenges Locking Spin-lock: do { lockval = atom_cmpxchg(&lock, 0, 1); } while (lockval!= 0); /* Critical section */ atom_xchg(&lock, 0); 31 Malloc in OpenCL Kernels: why and how? June 12th, 2012
31 How? Challenges Locking Spin-lock: do { lockval = atom_cmpxchg(&lock, 0, 1); } while (lockval!= 0); /* Critical section */ atom_xchg(&lock, 0); 32 Malloc in OpenCL Kernels: why and how? June 12th, 2012
32 How? Challenges Locking Spin-lock: while(true) { if(atom_cmpxchg(&lock, 0, 1) == 0) { /* critical section */ atom_xchg(&lock, 0); break; } } 33 Malloc in OpenCL Kernels: why and how? June 12th, 2012
33 How? Challenges Locking Spin-lock: while(true) { if(atom_cmpxchg(&lock, 0, 1) == 0) { /* critical section */ atom_xchg(&lock, 0); break; } } 34 Malloc in OpenCL Kernels: why and how? June 12th, 2012
34 How? Challenges Lock-Free Lock-free double linked list algorithm Unlink : O(p * log p) with p number of processors Only if all to-be-free'd blocks adjacent and scheduling least efficient Link: O(n) with n number of free blocks Repeat when failed, highly unlikely 35 Malloc in OpenCL Kernels: why and how? June 12th, 2012
35 How? Challenges Lock-Free Unlink Mark deleted While there is a next to-be deleted node pointing at you Make it point at your previous neighbour instead (cmpxchg) While there is no next to-be deleted node pointing at you If your previous node is not to be deleted cmpxchg it to right neighbour instead If that succeeds, block is no longer being linked to: unlinked! 36 Malloc in OpenCL Kernels: why and how? June 12th, 2012
36 How? Challenges Lock-Free Unlink Mark deleted While there is a next to-be deleted node pointing at you Make it point at your previous neighbour instead (cmpxchg) While there is no next to-be deleted node pointing at you If your previous node is not to be deleted cmpxchg it to right neighbour instead If that succeeds, block is no longer being linked to: unlinked! 37 Malloc in OpenCL Kernels: why and how? June 12th, 2012
37 How? Challenges Lock-Free Unlink Mark deleted While there is a next to-be deleted node pointing at you Make it point at your previous neighbour instead (cmpxchg) While there is no next to-be deleted node pointing at you If your previous node is not to be deleted cmpxchg it to right neighbour instead If that succeeds, block is no longer being linked to: unlinked! 38 Malloc in OpenCL Kernels: why and how? June 12th, 2012
38 How? Challenges Lock-Free Unlink Mark deleted While there is a next to-be deleted node pointing at you Make it point at your previous neighbour instead (cmpxchg) While there is no next to-be deleted node pointing at you If your previous node is not to be deleted cmpxchg it to right neighbour instead If that succeeds, block is no longer being linked to: unlinked! 39 Malloc in OpenCL Kernels: why and how? June 12th, 2012
39 How? Challenges Lock-Free Link Find right position, just before regular (not to be deleted) node Copy previous and next Alter previous nodes next-pointer to your node (cmpxchg) If succeeded, alter next nodes previous pointer as well and mark yourself free 40 Malloc in OpenCL Kernels: why and how? June 12th, 2012
40 How? Challenges Lock-Free Link Find right position, just before regular (not to be deleted) node Copy previous and next Alter previous nodes next-pointer to your node (cmpxchg) If succeeded, alter next nodes previous pointer as well and mark yourself free 41 Malloc in OpenCL Kernels: why and how? June 12th, 2012
41 How? Challenges ArrayList Straightforward Concurrency guaranteed by flexible memory allocator Single-linked-list list of memory blocks Synchronised allocation by prefix-sum 42 Malloc in OpenCL Kernels: why and how? June 12th, 2012
42 How? 43 Presentation Title Month ##, 2012
43 Conclusion Identified use-cases: demand for versatile memory management Improving maintainability of OpenCL kernels Improving development time Optimise together instead of tying to re-invent the wheel Proposal: Semi-traditional memory allocator DLMalloc without OS interaction (fixed heap size) Lock-free implemetation ArrayLists to optimise use of it 44 Malloc in OpenCL Kernels: why and how? June 12th, 2012
44 Achievements Proof of concept Sort-of working implementation of lock-free malloc Lock-free DLL algorithm not valid Perhaps resort to different back-end (SLL based?) and take performance penalty Or investigate locking possibilities when extending OpenCL Working implementation of ArrayList using malloc Required 6-8hr of development time 45 Malloc in OpenCL Kernels: why and how? June 12th, 2012
45 Future work Leverage prototype to library (requires OpenCL 1.2 linking capability) Research acceptance and shortcomings based on user response Impact of HSA on current hardware limitations Optimise algorithms For me: Find implementable thread-safe heap manager Finish prototype Benchmark: measure scalability Measure impact of (local/global) prefix-sum array-lists 46 Malloc in OpenCL Kernels: why and how? June 12th, 2012
46 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied. 48 Malloc in OpenCL Kernels: why and how? June 12th, 2012
SIMULATOR AMD RESEARCH JUNE 14, 2015
AMD'S gem5apu SIMULATOR AMD RESEARCH JUNE 14, 2015 OVERVIEW Introducing AMD s gem5 APU Simulator Extends gem5 with a GPU timing model Supports Heterogeneous System Architecture in SE mode Includes several
More informationOPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER
OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER Budirijanto Purnomo AMD Technical Lead, GPU Compute Tools PRESENTATION OVERVIEW Motivation AMD APP Profiler
More informationINTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD
INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationUse cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games
Viewdle Inc. 1 Use cases Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games 2 Why OpenCL matter? OpenCL is going to bring such
More informationEXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS
EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS James Ross High Performance Technologies, Inc (HPTi) Computational Scientist Edward Carmack David Richie Song Park, Brian Henz and Dale Shires HPTi
More informationHIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager
HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager INTRODUCTION WHO WE ARE 3 Highly Parallel Computing in Physics-based
More informationEFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT
EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time
More informationAMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011
AMD IOMMU VERSION 2 How KVM will use it Jörg Rödel August 16th, 2011 AMD IOMMU VERSION 2 WHAT S NEW? 2 AMD IOMMU Version 2 Support in KVM August 16th, 2011 Public NEW FEATURES - OVERVIEW Two-level page
More informationCAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to the features, functionality, availability, timing,
More informationACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research
ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations
More informationPanel Discussion: The Future of I/O From a CPU Architecture Perspective
Panel Discussion: The Future of I/O From a CPU Architecture Perspective Brad Benton AMD, Inc. #OFADevWorkshop Issues Move to Exascale involves more parallel processing across more processing elements GPUs,
More informationviewdle! - machine vision experts
viewdle! - machine vision experts topic using algorithmic metadata creation and heterogeneous computing to build the personal content management system of the future Page 2 Page 3 video of basic recognition
More informationAMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016
AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING BILL.BRANTLEY@AMD.COM, FELLOW 3 OCTOBER 2016 AMD S VISION FOR EXASCALE COMPUTING EMBRACING HETEROGENEITY CHAMPIONING OPEN SOLUTIONS ENABLING LEADERSHIP
More informationMIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011
MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to
More informationUnderstanding GPGPU Vector Register File Usage
Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture
More informationMulti-core processors are here, but how do you resolve data bottlenecks in native code?
Multi-core processors are here, but how do you resolve data bottlenecks in native code? hint: it s all about locality Michael Wall October, 2008 part I of II: System memory 2 PDC 2008 October 2008 Session
More informationADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer
ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session 2117 Olivier Zegdoun AMD Sr. Software Engineer CONTENTS Rendering Effects Before Fusion: single discrete GPU case Before Fusion: multiple discrete
More informationBIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD
BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM Dong Ping Zhang Heterogeneous System Architecture AMD VASCULATURE ENHANCEMENT 3 Biomedical data analysis on heterogeneous platform June, 2012 EXAMPLE:
More informationDesigning Natural Interfaces
Designing Natural Interfaces So what? Computers are everywhere C.T.D.L.L.C. Computers that don t look like computers. Computers that don t look like Computers Computers that don t look like Computers
More informationFUSION PROCESSORS AND HPC
FUSION PROCESSORS AND HPC Chuck Moore AMD Corporate Fellow & Technology Group CTO June 14, 2011 Fusion Processors and HPC Today: Multi-socket x86 CMPs + optional dgpu + high BW memory Fusion APUs (SPFP)
More informationAccelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration
Accelerating Applications the art of maximum performance computing James Spooner Maxeler VP of Acceleration Introduction The Process The Tools Case Studies Summary What do we mean by acceleration? How
More informationAMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES
AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES VERSION 1 - FEBRUARY 2018 CONTACT Address Advanced Micro Devices, Inc 7171 Southwest Pkwy Austin, Texas 78735 United States Phone
More informationAMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling
AMD Radeon Open Compute Platform Felix Kuehling ROCM PLATFORM ON LINUX Compiler Front End AMDGPU Driver Enabled with ROCm GCN Assembly Device LLVM Compiler (GCN) LLVM Opt Passes GCN Target Host LLVM Compiler
More informationINTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS
INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS ARKAPRAVA BASU, JOSEPH L. GREATHOUSE, GURU VENKATARAMANI, JÁN VESELÝ AMD RESEARCH, ADVANCED MICRO DEVICES, INC. MODERN SYSTEMS ARE POWERED BY HETEROGENEITY
More informationOpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data
OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data Andrew Miller Computer Vision Group Research Developer 3-D TERRAIN RECONSTRUCTION
More informationSCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL
SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt I: INTRODUCTION 3 Scaling
More informationRun Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008
Run Anywhere The Hardware Platform Perspective Ben Pollan, AMD Java Labs October 28, 2008 Agenda Java Labs Introduction Community Collaboration Performance Optimization Recommendations Leveraging the Latest
More information1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING
1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING TASK-PARALLELISM OpenCL, CUDA, OpenMP (traditionally) and the like are largely data-parallel models Their core unit of parallelism
More informationConcurrent Manipulation of Dynamic Data Structures in OpenCL
Concurrent Manipulation of Dynamic Data Structures in OpenCL Henk Mulder University of Twente P.O. Box 217, 7500AE Enschede The Netherlands h.mulder-1@student.utwente.nl ABSTRACT With the emergence of
More informationAMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution
AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION 1 3DMark Overview February 2013 Approved for public distribution 2 3DMark Overview February 2013 Approved for public distribution
More informationHETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE
HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)
More informationGestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO
Gestural and Cinematic Interfaces - DX11 David Brebner Unlimited Realities CTO Gestural and Cinematic Interfaces DX11 Making an emotional connection with users 3 Unlimited Realities / Fingertapps About
More informationCilk Plus: Multicore extensions for C and C++
Cilk Plus: Multicore extensions for C and C++ Matteo Frigo 1 June 6, 2011 1 Some slides courtesy of Prof. Charles E. Leiserson of MIT. Intel R Cilk TM Plus What is it? C/C++ language extensions supporting
More informationGeneric System Calls for GPUs
Generic System Calls for GPUs Ján Veselý*, Arkaprava Basu, Abhishek Bhattacharjee*, Gabriel H. Loh, Mark Oskin, Steven K. Reinhardt *Rutgers University, Indian Institute of Science, Advanced Micro Devices
More informationROCm: An open platform for GPU computing exploration
UCX-ROCm: ROCm Integration into UCX {Khaled Hamidouche, Brad Benton}@AMD Research ROCm: An open platform for GPU computing exploration 1 JUNE, 2018 ISC ROCm Software Platform An Open Source foundation
More informationAMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD
AMD APU and Processor Comparisons AMD Client Desktop Feb 2013 AMD SUMMARY 3DMark released Feb 4, 2013 Contains DirectX 9, DirectX 10, and DirectX 11 tests AMD s current product stack features DirectX 11
More informationSequential Consistency for Heterogeneous-Race-Free
Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K. REINHARDT, DAVID A. WOOD JUNE 12, 2013 EXECUTIVE
More informationFusion Enabled Image Processing
Fusion Enabled Image Processing I Jui (Ray) Sung, Mattieu Delahaye, Isaac Gelado, Curtis Davis MCW Strengths Complete Tools Port, Explore, Analyze, Tune Training World class R&D team Leading Algorithms
More informationCAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to
CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s positioning in the datacenter market; expected
More informationRegMutex: Inter-Warp GPU Register Time-Sharing
RegMutex: Inter-Warp GPU Register Time-Sharing Farzad Khorasani* Hodjat Asghari Esfeden Amin Farmahini-Farahani Nuwan Jayasena Vivek Sarkar *farkhor@gatech.edu The 45 th International Symposium on Computer
More informationAMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution
AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION 1 3DMark Overview April 2013 Approved for public distribution 2 3DMark Overview April 2013 Approved for public distribution
More informationTHE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD
THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU
More informationAMD S X86 OPEN64 COMPILER. Michael Lai AMD
AMD S X86 OPEN64 COMPILER Michael Lai AMD CONTENTS Brief History AMD and Open64 Compiler Overview Major Components of Compiler Important Optimizations Recent Releases Performance Applications and Libraries
More informationCAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018
CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s positioning in the datacenter market; expected
More informationHPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D
HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D AMD GRAPHIC CORE NEXT Low Power High Performance Graphics & Parallel Compute Michael Mantor AMD Senior Fellow Architect Michael.mantor@amd.com Mike Houston AMD
More informationCAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to
CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s strategy and focus, expected datacenter total
More informationSTREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer
STREAMING VIDEO DATA INTO 3D APPLICATIONS Session 2116 Christopher Mayer AMD Sr. Software Engineer CONTENT Introduction Pinned Memory Streaming Video Data How does the APU change the game 3 Streaming Video
More informationclarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018
clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018 ANECDOTE DISCOVERING A BUFFER OVERFLOW CPU GPU MEMORY MEMORY Data Data Data Data Data 2 clarmor: A
More informationKVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015
KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015 AGENDA Background & Motivation Challenges Native Page Tables Emulating the OS Kernel 2 KVM CPU MODEL IN SYSCALL EMULATION
More information3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components
3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components Xudong An, Manish Arora, Wei Huang, William C. Brantley, Joseph L. Greathouse AMD Research Advanced Micro Devices, Inc. MOTIVATION
More informationThe Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015
The Rise of Open Programming Frameworks JC BARATAULT IWOCL May 2015 1,000+ OpenCL projects SourceForge GitHub Google Code BitBucket 2 TUM.3D Virtual Wind Tunnel 10K C++ lines of code, 30 GPU kernels CUDA
More informationDesktop Telepresence Arrived! Sudha Valluru ViVu CEO
Desktop Telepresence Arrived! Sudha Valluru ViVu CEO 3 Desktop Telepresence Arrived! Video Collaboration market Telepresence Telepresence Cost Expensive Expensive HW HW Legacy Apps Interactivity ViVu CONFIDENTIAL
More informationNEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer
NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer SESSION AGENDA Quick Keywords Abstract and Scope Introduction Current User Interface [ UI
More informationA comprehensive study of Dynamic Memory Management in OpenCL kernels
A comprehensive study of Dynamic Memory Management in OpenCL kernels Master thesis report Roy Spliet (1318977, R.Spliet@student.tudelft.nl) Faculty of Electrical Engineering, Mathematics and Computer Science
More informationSOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015
SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015 PROBLEM Shaders are compiled in draw calls Emulating certain features in shaders Drivers keep shaders in some intermediate representation And insert
More informationHeterogeneous Computing
Heterogeneous Computing Featured Speaker Ben Sander Senior Fellow Advanced Micro Devices (AMD) DR. DOBB S: GPU AND CPU PROGRAMMING WITH HETEROGENEOUS SYSTEM ARCHITECTURE Ben Sander AMD Senior Fellow APU:
More informationThe Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015
The Road to the AMD Fiji GPU Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015 Fiji Chip DETAILED LOOK 4GB High-Bandwidth Memory 4096-bit wide interface 512 GB/s
More informationKMA: A Dynamic Memory Manager for OpenCL
KMA: A Dynamic Memory Manager for OpenCL Roy Spliet Delft University of Technology The Netherlands Lee Howes, Benedict R. Gaster AMD USA Ana Lucia Varbanescu University of Amsterdam The Netherlands ABSTRACT
More informationMEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE
MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE VIGNESH ADHINARAYANAN, INDRANI PAUL, JOSEPH L. GREATHOUSE, WEI HUANG, ASHUTOSH PATTNAIK, WU-CHUN FENG POWER AND ENERGY ARE FIRST-CLASS
More informationAMD Radeon ProRender plug-in for Unreal Engine. Installation Guide
AMD Radeon ProRender plug-in for Unreal Engine Installation Guide This document is a guide on how to install and configure AMD Radeon ProRender plug-in for Unreal Engine. DISCLAIMER The information contained
More informationHyperTransport Technology
HyperTransport Technology in 2009 and Beyond Mike Uhler VP, Accelerated Computing, AMD President, HyperTransport Consortium February 11, 2009 Agenda AMD Roadmap Update Torrenza, Fusion, Stream Computing
More informationAMD EPYC CORPORATE BRAND GUIDELINES
AMD EPYC CORPORATE BRAND GUIDELINES VERSION 1 MAY 2017 CONTACT Address Advanced Micro Devices, Inc 7171 Southwest Pkwy Austin, Texas 78735 United States Phone 1-512-602-1000 Online Email: Brand.Team@amd.com
More informationAMD RYZEN CORPORATE BRAND GUIDELINES
AMD RYZEN CORPORATE BRAND GUIDELINES VERSION 4 - JULY 2017 CONTACT Address Advanced Micro Devices, Inc 7171 Southwest Pkwy Austin, Texas 78735 United States Phone Phone: 1-512-602-1000 Online Email: Brand.Team@amd.com
More informationVulkan (including Vulkan Fast Paths)
Vulkan (including Vulkan Fast Paths) Łukasz Migas Software Development Engineer WS Graphics Let s talk about OpenGL (a bit) History 1.0-1992 1.3-2001 multitexturing 1.5-2003 vertex buffer object 2.0-2004
More informationMaximizing Six-Core AMD Opteron Processor Performance with RHEL
Maximizing Six-Core AMD Opteron Processor Performance with RHEL Bhavna Sarathy Red Hat Technical Lead, AMD Sanjay Rao Senior Software Engineer, Red Hat Sept 4, 2009 1 Agenda Six-Core AMD Opteron processor
More informationFLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions
FLASH MEMORY SUMMIT 2011 Adoption of Caching & Hybrid Solutions Market Overview 2009 Flash production reached parity with all other existing solid state memories in terms of bites. 2010 Overall flash production
More informationPattern-based analytics to estimate and track yield risk of designs down to 7nm
DAC 2017 Pattern-based analytics to estimate and track yield risk of designs down to 7nm JASON CAIN, MOUTAZ FAKHRY (AMD) PIYUSH PATHAK, JASON SWEIS, PHILIPPE HURAT, YA-CHIEH LAI (CADENCE) INTRODUCTION
More informationD3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD
D3D12 & Vulkan: Lessons learned Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD D3D12 What s new? DXIL DXGI & UWP updates Root Signature 1.1 Shader cache GPU validation PIX D3D12 / DXIL DXBC
More informationLIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT
LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT Bootstrapping the industry for better VR experience Complimentary to HMD SDKs It s all about giving developers the tools they want! AMD LIQUIDVR
More informationGraphics Hardware 2008
AMD Smarter Choice Graphics Hardware 2008 Mike Mantor AMD Fellow Architect michael.mantor@amd.com GPUs vs. Multi-core CPUs On a Converging Course or Fundamentally Different? Many Cores Disruptive Change
More informationAMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide
AMD HD3D Technology Setup Guide 1 AMD HD3D TECHNOLOGY: Setup Guide Contents AMD HD3D Technology... 3 Frame Sequential Displays... 4 Supported 3D Display Hardware... 5 AMD Display Drivers... 5 Configuration
More informationHPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture
HPCA 18 Reliability-aware Data Placement for Heterogeneous memory Architecture Manish Gupta Ψ, Vilas Sridharan*, David Roberts*, Andreas Prodromou Ψ, Ashish Venkat Ψ, Dean Tullsen Ψ, Rajesh Gupta Ψ Ψ *
More informationIntroducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs
, Inc. Introducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs Doug Finke Director of Product Marketing September 2016
More informationAMD AIB Partner Guidelines. Version February, 2015
AMD AIB Partner Guidelines Version 1.0 - February, 2015 The Purpose of This Document These guidelines provide direction for our Add-in-Board (AIB) partners and customers to market the benefits of AMD products
More informationAMD SEV Update Linux Security Summit David Kaplan, Security Architect
AMD SEV Update Linux Security Summit 2018 David Kaplan, Security Architect WHY NOT TRUST THE HYPERVISOR? Guest Perspective o Hypervisor is code I don t control o I can t tell if the hypervisor is compromised
More informationGPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011
GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput
More informationPROTECTING VM REGISTER STATE WITH AMD SEV-ES DAVID KAPLAN LSS 2017
PROTECTING VM REGISTER STATE WITH AMD SEV-ES DAVID KAPLAN LSS 2017 BACKGROUND-- HARDWARE MEMORY ENCRYPTION AMD Secure Memory Encryption (SME) / AMD Secure Encrypted Virtualization (SEV) Hardware AES engine
More informationMULTIMEDIA PROCESSING Real-time H.264 video enhancement by using AMD APP SDK
MULTIMEDIA PROCESSING Real-time H.264 video enhancement by using AMD APP SDK Wei-Lien Hsu AMD SMTS Gongyuan Zhuang AMD MTS OUTLINE Motivation OpenDecode Video deblurring algorithms Acceleration by clamdfft
More informationDR. LISA SU
CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s strategy and focus, expected datacenter total
More informationThe mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management
Next-Generation Mobile Computing: Balancing Performance and Power Efficiency HOT CHIPS 19 Jonathan Owen, AMD Agenda The mobile computing evolution The Griffin architecture Memory enhancements Power management
More informationAnatomy of AMD s TeraScale Graphics Engine
Anatomy of AMD s TeraScale Graphics Engine Mike Houston Design Goals Focus on Efficiency f(perf/watt, Perf/$) Scale up processing power and AA performance Target >2x previous generation Enhance stream
More informationParallel Memory Defragmentation on a GPU
Parallel Memory Defragmentation on a GPU Ronald Veldema, Michael Philippsen University of Erlangen-Nuremberg Germany Informatik 2 Programmiersysteme Martensstraße 3 91058 Erlangen Motivation Application
More informationFan Control in AMD Radeon Pro Settings. User Guide. This document is a quick user guide on how to configure GPU fan speed in AMD Radeon Pro Settings.
Fan Control in AMD Radeon Pro Settings User Guide This document is a quick user guide on how to configure GPU fan speed in AMD Radeon Pro Settings. DISCLAIMER The information contained herein is for informational
More informationChanging your Driver Options with Radeon Pro Settings. Quick Start User Guide v3.0
Changing your Driver Options with Radeon Pro Settings Quick Start User Guide v3.0 This guide will show you how to switch between Professional Mode and Gaming Mode when using Radeon Pro Software. DISCLAIMER
More informationResource Saving: Latest Innovation in Optimized Cloud Infrastructure
Resource Saving: Latest Innovation in Optimized Cloud Infrastructure CloudFest 2018 Presented by Martin Galle, Director FAE We Keep ITSupermicro Green 2018 Cloud Computing Development Technology Evolution
More informationParallel storage allocator
CSE 539 02/7/205 Parallel storage allocator Lecture 9 Scribe: Jing Li Outline of this lecture:. Criteria and definitions 2. Serial storage allocators 3. Parallel storage allocators Criteria and definitions
More informationDriver Options in AMD Radeon Pro Settings. User Guide
Driver Options in AMD Radeon Pro Settings User Guide This guide will show you how to switch between Professional Mode and Gaming Mode when using Radeon Pro Software. DISCLAIMER The information contained
More informationHoard: A Fast, Scalable, and Memory-Efficient Allocator for Shared-Memory Multiprocessors
Hoard: A Fast, Scalable, and Memory-Efficient Allocator for Shared-Memory Multiprocessors Emery D. Berger Robert D. Blumofe femery,rdbg@cs.utexas.edu Department of Computer Sciences The University of Texas
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationMicrosoft Windows 2016 Mellanox 100GbE NIC Tuning Guide
Microsoft Windows 2016 Mellanox 100GbE NIC Tuning Guide Publication # 56288 Revision: 1.00 Issue Date: June 2018 2018 Advanced Micro Devices, Inc. All rights reserved. The information contained herein
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationChanging your Driver Options with Radeon Pro Settings. Quick Start User Guide v2.1
Changing your Driver Options with Radeon Pro Settings Quick Start User Guide v2.1 This guide will show you how to switch between Professional Mode and Gaming Mode when using Radeon Pro Software. DISCLAIMER
More informationUser Manual. Nvidia Jetson Series Carrier board Aetina ACE-N622
User Manual Nvidia Jetson Series Carrier board Aetina ACE-N622 i Document Change History Version Date Description Authors V1 2018/05/23 Initial Release. Eric Chu V2 2018/06/22 Specification change Eric
More informationThermal Design Guide for Socket SP3 Processors
Thermal Design Guide for Socket SP3 Processors Publication # 55423 Rev: 3.00 Issue Date: November 2017 2017 Advanced Micro Devices, Inc. All rights reserved. The information contained herein is for informational
More informationThe Art and Science of Memory Allocation
Logical Diagram The Art and Science of Memory Allocation Don Porter CSE 506 Binary Formats RCU Memory Management Memory Allocators CPU Scheduler User System Calls Kernel Today s Lecture File System Networking
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationWhite Paper AMD64 TECHNOLOGY SPECULATIVE STORE BYPASS DISABLE
White Paper AMD64 TECHNOLOGY SPECULATIVE STORE BYPASS DISABLE 2018 Advanced Micro Devices Inc. All rights reserved. The information contained herein is for informational purposes only, and is subject to
More informationA Comprehensive Complexity Analysis of User-level Memory Allocator Algorithms
2012 Brazilian Symposium on Computing System Engineering A Comprehensive Complexity Analysis of User-level Memory Allocator Algorithms Taís Borges Ferreira, Márcia Aparecida Fernandes, Rivalino Matias
More informationFast Dynamic Memory Allocator for Massively Parallel Architectures
Fast Dynamic Memory Allocator for Massively Parallel Architectures Sven Widmer Graduate School Computational Engineering Dominik Wodniok Graduate School Computational Engineering Michael Goesele Graduate
More information