1 Presentation Title Month ##, 2012

Size: px

Start display at page:

Download "1 Presentation Title Month ##, 2012"

Frank Harper
5 years ago
Views:

1 1 Presentation Title Month ##, 2012

Malloc in OpenCL kernels Why and how? Roy Spliet Bsc. (r.spliet@student.

Varbanescu Prof. Dr. Ir. H.J. Sips Delft University of Technology Dr. B.

2 Malloc in OpenCL kernels Why and how? Roy Spliet Bsc. Delft University of Technology Student Msc. Dr. A.L. Varbanescu Prof. Dr. Ir. H.J. Sips Delft University of Technology Dr. B.R. Gaster Dr. L.W. Howes Advanced Micro Devices 2 Presentation Title Month ##, 2012

3 Why? Environment 3 Presentation Title Month ##, 2012

4 Why? Environment Thousands of work-items Maintain consistent heap state difficult Single Instruction, Multiple Threads Avoid divergent branching Assume work-items are in the exact same instruction of program Goal is problem solving, not communication Speed! Hardware limitations 5 Malloc in OpenCL Kernels: why and how? June 12th, 2012

5 Why? Environment OpenCL kernels cannot request memory from host Memory managed in device driver Only option: Allocate in advance Round trip to host would be expensive What happens on the device, stays on the device Pointers not valid on host system Context gets lost on transfer (Pre-allocated space: work with offsets instead) Speed Allocating on device means allocating on host, then on device Allocating on host is faster 6 Malloc in OpenCL Kernels: why and how? June 12th, 2012

6 Why? Environment Thousands of work-items Maintain consistent heap state difficult Single Instruction, Multiple Threads Avoid divergent branching Assume work-items are in the exact same instruction of program Goal is problem solving, not communication Speed! Hardware limitations No GPU->Host communication: no solution for memory overestimation Lost context: only use for temporary variables Overhead: avoid using malloc() when possible in-kernel 7 Malloc in OpenCL Kernels: why and how? June 12th, 2012

7 Why? Research 8 Presentation Title Month ##, 2012

8 Why? Research Set-up Use-case study: Find and investigate a diverse set of parallel programs and categorise their usage of: Malloc() Call frequency Object size Object amount Free() Call frequency Allocated memory chunks Global/local (read) access Access pattern 9 Malloc in OpenCL Kernels: why and how? June 12th, 2012

9 Why? Research Set-up Algorithm Class Program Source Library Finite State Machine Combinatorial Graph Traversal Structured Grid Dense Linear Algebra Sparse Matrix Spectral (FFT) Dynamic Programming N-Body/Particle Methods MapReduce Backtracking Unstructured Grid 1. Krste Asanovic et al. A view of the parallel computing landscape. Commun. ACM, 52:56 67, October Malloc in OpenCL Kernels: why and how? June 12th, 2012

10 Why? Research Set-up Algorithm Class Program Source Library Finite State Machine Level-7 filtering Case Hellas University Combinatorial Graph Traversal Graph Analysis Code TU Delft OpenCL Structured Grid Heart Wall Code Rodinia OpenMP Dense Linear Algebra K-Means Code Rodinia OpenMP Sparse Matrix SPMV Code Parboil Cuda Spectral (FFT) FFT Code Parboil Cuda Dynamic Programming Dijkstra Theory N-Body/Particle Methods Barnes-Hut Code Texas State University OpenCL MapReduce Backtracking Unstructured Grid Back-propagation Code Rodinia OpenMP 1. Krste Asanovic et al. A view of the parallel computing landscape. Commun. ACM, 52:56 67, October Malloc in OpenCL Kernels: why and how? June 12th, 2012

11 Why? Research Results Finite State Machine Dynamic programming Arbitrary sized worklist - scheduling Graph Traversal Worklist scheduling Graph N-Body simulation Octree scheduling 12 Malloc in OpenCL Kernels: why and how? June 12th, 2012

12 Why? Research Results Process input character for single NFA in parallel Non-deterministic Finite Automaton Next state not deterministic 13 Malloc in OpenCL Kernels: why and how? June 12th, 2012

13 Why? Research Results Process input character for single NFA in parallel Non-deterministic Finite Automaton Next state not deterministic Input A : 4 states Dynamic sized work queue: use 4 threads for next input word in parallel Specific case of graph traversal (just like dynamic programming) Unknown task list size: overcompensate or malloc() 14 Malloc in OpenCL Kernels: why and how? June 12th, 2012

14 Why? Research Results Properties of use-cases: Memory is allocated possibly many times, but free'd once in the end Allocated memory is always an array of equally sized objects Each work-item accesses own memory linearly or with fixed intervals On GPU's memory bus this will correspond to random access But try a generic design Object allocation in C++ kernels 15 Malloc in OpenCL Kernels: why and how? June 12th, 2012

15 Why? Research Results Other programs Memory chunk size determined on host Proportional to input Proportional to number of threads Data uploaded or downloaded Or no global storage at all Local variables a lot faster 16 Malloc in OpenCL Kernels: why and how? June 12th, 2012

16 Why? 17 Presentation Title Month ##, 2012

17 Why? It's not trivial Identified use-cases: demand for more versatile memory management Maintainability of OpenCL kernels Shorter development time Heap- and list management optimised together Learn more about users needs Determine if limitations should be eliminated based on user feedback We write code for memory management, so you don't have to 18 Malloc in OpenCL Kernels: why and how? June 12th, 2012

18 How? Design proposal 19 Presentation Title Month ##, 2012

19 How? Design proposal Requirements Use-case study Lists of equal-sized objects Allocate per-iteration Free entire lists when done Global access But generic design Platform Thread-safe Fast Improve locality, fill up memory bus 20 Malloc in OpenCL Kernels: why and how? June 12th, 2012

20 How? Design proposal ArrayList AddXXX(): Improve performance by calling heap manager as little as possible Prefix-sum reduction to gather memory requirements: O(log p) 2 List of equal-sized objects Item(): global access Clear(): free all at once Heap manager Optimised for relatively large chunks Traditional use, No limitations 2. Xiaohuang Huang et al. Xmalloc: A scalable lock-free dynamic memory allocator for many-core machines. In Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on, pages , july Malloc in OpenCL Kernels: why and how? June 12th, 2012

21 How? Challenges 22 Presentation Title Month ##, 2012

22 How? Challenges Heap management algorithm Platform Thread-safe Scalable Fast Efficient memory allocation Low fragmentation High performance Use-cases Generic Efficient for use with ArrayList objects. eg. Medium sized memory chunks 23 Malloc in OpenCL Kernels: why and how? June 12th, 2012

23 How? Challenges Heap management algorithm Hoard 3 Local heap for each processor (or more) to fight concurrent access Small objects: (multi-page) Superblocks with equally-sized objects Large objects: directly served pages Not efficient with thousands of threads Low utilisation of malloc(), relatively small objects per thread superblocks will not fill up. Administration per-thread larger than memory-demand per-thread. 3. Emery D. Berger et al. Hoard: a scalable memory allocator for multithreaded applications. SIGPLAN Not., 35:117 28, November Malloc in OpenCL Kernels: why and how? June 12th, 2012

24 How? Challenges Heap management algorithm DLMalloc 4 One heap Small objects: pages of equally sized blocks Medium objects: best fit from available memory Large objects: serve directly from OS Free memory blocks linked together in double-linked list, categorised in size buckets, ordered by size. Serve directly from OS not possible Best-fit (with coalescing) gives initial problems 4. D Lea. A memory allocator, October Malloc in OpenCL Kernels: why and how? June 12th, 2012

How? Challenges Heap management algorithm Best-fit Blocks of arbitrary sizes, possibly aligned Free blocks linked together, categorised in size buckets Free

25 How? Challenges Heap management algorithm Best-fit Blocks of arbitrary sizes, possibly aligned Free blocks linked together, categorised in size buckets Free blocks as large as possible: coalescing adjacent free blocks Split off desired chunk on allocation 26 Malloc in OpenCL Kernels: why and how? June 12th, 2012

26 How? Challenges Heap management algorithm Best-fit Blocks of arbitrary sizes, possibly aligned Free blocks linked together, categorised in size buckets Free blocks as large as possible: coalescing adjacent free blocks Split off desired chunk on allocation 27 Malloc in OpenCL Kernels: why and how? June 12th, 2012

27 How? Challenges Heap management algorithm DLMalloc 4 One heap Small objects: pages of equally sized blocks Medium objects: best fit from available memory Large objects: serve directly from OS Free memory blocks linked together in double-linked list, categorised in size buckets, ordered by size. Serve directly from OS not possible Best-fit (with coalescing) gives initial problems One large block available, many threads requiring a small chunk One at the time take entire block, split own chunk off, free the rest for next thread 4. D Lea. A memory allocator, October Malloc in OpenCL Kernels: why and how? June 12th, 2012

28 How? Challenges Locking or Lock-Free atom_cmpxchg: Atomically compare, exchange if equal atom_cmpxchg(&var, a, b) { old = *var; } If (old == a) *var = b; return old; 29 Malloc in OpenCL Kernels: why and how? June 12th, 2012

29 How? Challenges Locking or Lock-Free Locking: Does not scale well Performance linear with the number of threads Is straightforward to implement Lock-free: Scales better on multi-core CPU's But not necessarily on GPU's CMPXCHG instruction fails a lot when executed in SIMD-like work-groups And is complex to implement 30 Malloc in OpenCL Kernels: why and how? June 12th, 2012

30 How? Challenges Locking Spin-lock: do { lockval = atom_cmpxchg(&lock, 0, 1); } while (lockval!= 0); /* Critical section */ atom_xchg(&lock, 0); 31 Malloc in OpenCL Kernels: why and how? June 12th, 2012

31 How? Challenges Locking Spin-lock: do { lockval = atom_cmpxchg(&lock, 0, 1); } while (lockval!= 0); /* Critical section */ atom_xchg(&lock, 0); 32 Malloc in OpenCL Kernels: why and how? June 12th, 2012

32 How? Challenges Locking Spin-lock: while(true) { if(atom_cmpxchg(&lock, 0, 1) == 0) { /* critical section */ atom_xchg(&lock, 0); break; } } 33 Malloc in OpenCL Kernels: why and how? June 12th, 2012

33 How? Challenges Locking Spin-lock: while(true) { if(atom_cmpxchg(&lock, 0, 1) == 0) { /* critical section */ atom_xchg(&lock, 0); break; } } 34 Malloc in OpenCL Kernels: why and how? June 12th, 2012

34 How? Challenges Lock-Free Lock-free double linked list algorithm Unlink : O(p * log p) with p number of processors Only if all to-be-free'd blocks adjacent and scheduling least efficient Link: O(n) with n number of free blocks Repeat when failed, highly unlikely 35 Malloc in OpenCL Kernels: why and how? June 12th, 2012

35 How? Challenges Lock-Free Unlink Mark deleted While there is a next to-be deleted node pointing at you Make it point at your previous neighbour instead (cmpxchg) While there is no next to-be deleted node pointing at you If your previous node is not to be deleted cmpxchg it to right neighbour instead If that succeeds, block is no longer being linked to: unlinked! 36 Malloc in OpenCL Kernels: why and how? June 12th, 2012

36 How? Challenges Lock-Free Unlink Mark deleted While there is a next to-be deleted node pointing at you Make it point at your previous neighbour instead (cmpxchg) While there is no next to-be deleted node pointing at you If your previous node is not to be deleted cmpxchg it to right neighbour instead If that succeeds, block is no longer being linked to: unlinked! 37 Malloc in OpenCL Kernels: why and how? June 12th, 2012

37 How? Challenges Lock-Free Unlink Mark deleted While there is a next to-be deleted node pointing at you Make it point at your previous neighbour instead (cmpxchg) While there is no next to-be deleted node pointing at you If your previous node is not to be deleted cmpxchg it to right neighbour instead If that succeeds, block is no longer being linked to: unlinked! 38 Malloc in OpenCL Kernels: why and how? June 12th, 2012

38 How? Challenges Lock-Free Unlink Mark deleted While there is a next to-be deleted node pointing at you Make it point at your previous neighbour instead (cmpxchg) While there is no next to-be deleted node pointing at you If your previous node is not to be deleted cmpxchg it to right neighbour instead If that succeeds, block is no longer being linked to: unlinked! 39 Malloc in OpenCL Kernels: why and how? June 12th, 2012

39 How? Challenges Lock-Free Link Find right position, just before regular (not to be deleted) node Copy previous and next Alter previous nodes next-pointer to your node (cmpxchg) If succeeded, alter next nodes previous pointer as well and mark yourself free 40 Malloc in OpenCL Kernels: why and how? June 12th, 2012

40 How? Challenges Lock-Free Link Find right position, just before regular (not to be deleted) node Copy previous and next Alter previous nodes next-pointer to your node (cmpxchg) If succeeded, alter next nodes previous pointer as well and mark yourself free 41 Malloc in OpenCL Kernels: why and how? June 12th, 2012

41 How? Challenges ArrayList Straightforward Concurrency guaranteed by flexible memory allocator Single-linked-list list of memory blocks Synchronised allocation by prefix-sum 42 Malloc in OpenCL Kernels: why and how? June 12th, 2012

42 How? 43 Presentation Title Month ##, 2012

43 Conclusion Identified use-cases: demand for versatile memory management Improving maintainability of OpenCL kernels Improving development time Optimise together instead of tying to re-invent the wheel Proposal: Semi-traditional memory allocator DLMalloc without OS interaction (fixed heap size) Lock-free implemetation ArrayLists to optimise use of it 44 Malloc in OpenCL Kernels: why and how? June 12th, 2012

44 Achievements Proof of concept Sort-of working implementation of lock-free malloc Lock-free DLL algorithm not valid Perhaps resort to different back-end (SLL based?) and take performance penalty Or investigate locking possibilities when extending OpenCL Working implementation of ArrayList using malloc Required 6-8hr of development time 45 Malloc in OpenCL Kernels: why and how? June 12th, 2012

45 Future work Leverage prototype to library (requires OpenCL 1.2 linking capability) Research acceptance and shortcomings based on user response Impact of HSA on current hardware limitations Optimise algorithms For me: Find implementable thread-safe heap manager Finish prototype Benchmark: measure scalability Measure impact of (local/global) prefix-sum array-lists 46 Malloc in OpenCL Kernels: why and how? June 12th, 2012

46 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied. 48 Malloc in OpenCL Kernels: why and how? June 12th, 2012

SIMULATOR AMD RESEARCH JUNE 14, 2015

AMD'S gem5apu SIMULATOR AMD RESEARCH JUNE 14, 2015 OVERVIEW Introducing AMD s gem5 APU Simulator Extends gem5 with a GPU timing model Supports Heterogeneous System Architecture in SE mode Includes several