A cache-aware performance prediction framework for GPGPU computations

Size: px

Start display at page:

Download "A cache-aware performance prediction framework for GPGPU computations"

Asher Pierce
6 years ago
Views:

1 A cache-aware performance prediction framework for GPGPU computations The 8th Workshop on UnConventional High Performance Computing 215 Alexander Pöppl, Alexander Herz August 24th, 215 UCHPC 215, August 24th, 215 1

2 Agenda Introduction Motivation Contributions Example Model Execution Time Computation Memory Transfer Empty Kernels Workgroup Size Basic Operations Memory accesses Evaluation Qualitative Evaluation Quantitative Evaluation Further Work UCHPC 215, August 24th, 215 2

3 Introduction Motivation OpenCL is used for running heterogeneous HPC applications It is low level, fairly explicit, and has manual task management 1 Cédric Augonnet et al. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. English. In: Euro-Par 29 Parallel Processing. Ed. by Henk Sips, Dick Epema, and Hai-Xiang Lin. Vol Lecture Notes in Computer Science. Springer Berlin Heidelberg, 29, pp ISBN: DOI: 1.17/ _8. URL: 2 Gregory F. Diamos and Sudhakar Yalamanchili. Harmony: An Execution Model and Runtime for Heterogeneous Many Core Systems. In: Proceedings of the 17th International Symposium on High Performance Distributed Computing. HPDC 8. Boston, MA, USA: ACM, 28, pp ISBN: DOI: / URL: UCHPC 215, August 24th, 215 3

4 Introduction Motivation OpenCL is used for running heterogeneous HPC applications It is low level, fairly explicit, and has manual task management Hence runtime systems with schedulers, such as StarPU 1 or Harmony 2 have been developed These schedule tasks onto heterogeneous hardware based on expected runtime. High-quality estimations crucial for efficient schedules. 1 Cédric Augonnet et al. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. English. In: Euro-Par 29 Parallel Processing. Ed. by Henk Sips, Dick Epema, and Hai-Xiang Lin. Vol Lecture Notes in Computer Science. Springer Berlin Heidelberg, 29, pp ISBN: DOI: 1.17/ _8. URL: 2 Gregory F. Diamos and Sudhakar Yalamanchili. Harmony: An Execution Model and Runtime for Heterogeneous Many Core Systems. In: Proceedings of the 17th International Symposium on High Performance Distributed Computing. HPDC 8. Boston, MA, USA: ACM, 28, pp ISBN: DOI: / URL: UCHPC 215, August 24th, 215 3

5 Introduction Motivation Performance Prediction models already exist, and work well with earlier GPU architectures. Introduction of Caches complicate predictions. GPU memory Hierarchy needs to be considered. UCHPC 215, August 24th, 215 4

6 Introduction Contributions Categorization of memory accesses into classes with distinct performance characteristics. Fully static OpenCL computation prediction model. Evaluation using randomly generated OpenCL kernels shows that a cache-aware model improves predictions. UCHPC 215, August 24th, 215 5

7 Introduction Example Popular operation: Stencil operations Array of size: n m b(i, j) = a(i, j) 2 a(1, j) UCHPC 215, August 24th, 215 6

8 Introduction Example 1: n WI = m n 2: mem input GPU device.alloc(n WI s WI ) 3: mem output GPU device.alloc(n WI s WI ) 4: copydatatogpu( mem input GPU ) 5: device.kernel(n WI, n WG, m, n) id {,.., n WI }. sq mod(mem input 6: copydatafromgpu( mem output GPU ) GPU, memoutput GPU, m, n) UCHPC 215, August 24th, 215 7

9 Introduction Example kernel void sq mod ( global f l o a t matrix, global f l o a t res, unsigned i n t m, unsigned i n t n ) { s i z e t c u r r e n t p o s = g e t g l o b a l i d ( ) ; unsigned i n t c u r r e n t r o w = c u r r e n t p o s / n ; unsigned i n t c u r r e n t c o l = c u r r e n t p o s % n ; res [ c u r r e n t p o s ] = m a t r i x [ c u r r e n t r o w n + c u r r e n t c o l ] matrix [ c u r r e n t r o w n + c u r r e n t c o l ] matrix [ c u r r e n t c o l ] ; } UCHPC 215, August 24th, 215 8

10 Model Execution Time Computation Computation of the Runtime t(n WI, s WI, n WG ) = t Transfer (n WI, s WI ) + t Kernel (n WI, n WG ) t Kernel (n WI, n WG ) = t Base(n WI ) + Op Expr.-Types W Op(n Op )t Op (n WI ) U(n WG, n XU ) n WI n WG s WI n XU Number of work-items Number of work-items per work-group Size of a work-item in bytes Number of execution units on the GPU UCHPC 215, August 24th, 215 9

11 Model Memory Transfer GPUs have a dedicated portion of memory for their computations Time for memory transfer governed by two variables bw Bandwidth l prop Propagation latency UCHPC 215, August 24th, 215 1

12 Model Memory Transfer Time in ms # DWords 1 7 Figure: Memory Transfer times To GPU From GPU ttrans to (n WI) = bw 1 to n WI + l to ttrans from(n WI) = bw 1 from n WI + l from UCHPC 215, August 24th,

13 Model Empty Kernels Time in ms.4.2 Empty Kernel Runtime n WI Figure: Execution times for empty kernels. t Base (n WI ) = c Base n WI + c fixed Base UCHPC 215, August 24th,

14 Model Workgroup Size Time in ms Observed Runtime Observed Runtime 5 1 Work-items per work-group (a) NVidia GT-65M 5 1 Work-items per work-group (b) Intel HD Graphics 4 Figure: Execution time for different work-group sizes. The kernel we used to evaluate this behavior performs one read from and write to the global memory, and one floating point division. UCHPC 215, August 24th,

15 Model Workgroup Size Modelling the behavior Periodic spikes in execution time. Especially visible on the HD 4. Influence of Work-Group size U(n WG, n XU ) = n WG n XU n + n WG mod n XU nwg n XU n WG n XU WG n XU n XU n WG n }{{} XU }{{} A B UCHPC 215, August 24th,

16 Model Basic Operations 6 Time in ms 4 2 / Float + Float Float Float Float Float Float n WI 1 7 n Ops (a) One operation per work-item (b) Multiple Operations per work-item Figure: Progression of the execution time for basic operations. UCHPC 215, August 24th,

17 Model Basic Operations W type op (n Ops ) = t type op { a n b Ops + c : n Ops nops sat a n Ops + c (n WI ) = c type op n WI : n Ops > n sat Ops a, a, b, c, c are obtained by fitting Wop type (n Ops ) to 4b c type op is obtained by fitting t type op (n WI ) to 4a. UCHPC 215, August 24th,

18 Model Memory accesses In OpenCL, 3 different kinds of memory accesses are available private: Used for local variables, parameters. local: Shared between work-items within a work-group global: Shared amongst all work-items Usually implemented using different kinds of memory. UCHPC 215, August 24th,

19 Model Memory accesses Time in ms Global Read Global Write Local Read Local Write Private Access.5 1 n WI 1 7 UCHPC 215, August 24th,

20 Model Memory accesses Coalesced Accesses 1 8 Coalesced Time in ms n WI 1 7 UCHPC 215, August 24th,

21 Model Memory accesses Constant Accesses 1 8 Coalesced Constant Time in ms n WI 1 7 UCHPC 215, August 24th, 215 2

22 Model Memory accesses Interval Accesses 1 8 Coalesced Interval Constant Time in ms n WI 1 7 UCHPC 215, August 24th,

23 Model Memory accesses Two Identical Accesses Time in ms Coalesced 2 Identical coalesced Interval Constant.5 1 n WI 1 7 UCHPC 215, August 24th,

24 Model Memory accesses Complex Accesses Time in ms Complex Coalesced 2 Identical coalesced Interval Constant.5 1 n WI 1 7 UCHPC 215, August 24th,

25 Evaluation Qualitative Evaluation Static prediction of the execution time given the following data: Kernel Source Code Data about GPU characteristics Number of work-items n WI UCHPC 215, August 24th,

26 Evaluation Qualitative Evaluation Static prediction of the execution time given the following data: Kernel Source Code Data about GPU characteristics Number of work-items n WI Cost Type # in Kernel Time in µs float float int int / int private access 1. interval global read access continuous global read access base cost work-group size final prediction 889 UCHPC 215, August 24th,

27 Evaluation Qualitative Evaluation 1 2 Our model Observation Time in s Number of Elements UCHPC 215, August 24th,

28 Evaluation Quantitative Evaluation Quantitative evaluation through generated OpenCL Kernels 2 Sets of kernels, Unrestricted and Realistic Unrestricted Set Little restrictions on complexity Complex memory access patterns possible ((xxx[((y + x) + 454) & x7f] / (matrix[x][y] * x)) - (matrix[x][y] + (matrix[x][y] + ((matrix[(4419 * (2 + x)) % HEIGHT][194 % WIDTH] - xxx[71632 & x7f]) - (( f * (x - y)) + ( f / (((((matrix[x][y] - matrix[x][y]) - xxx[(y * x) & x7f]) f) + matrix[x][y]) f))))))) + xxx[x & x7f] Realistic Set Complexity restricted, limited number of nodes in syntax tree No overly complex memory access patterns ((x / (xxx[x & x7f] / (matrix[1 % HEIGHT][361 % WIDTH] * matrix[x][y]))) * xxx[y & x7f]) f UCHPC 215, August 24th,

29 Evaluation Quantitative Evaluation GT-65M t prediction t result t prediction t result (a) Realistic Set (b) Unrestricted Set.7 < t prediction t result.7 < t prediction t result < 1.3 for 63% of all samples for the restricted set. < 1.3 for 5% of all samples for the unrestricted set. UCHPC 215, August 24th,

30 Evaluation Quantitative Evaluation Quadro K t prediction t result t prediction t result (c) Realistic Set (d) Unrestricted Set.7 < t prediction t result.7 < t prediction t result < 1.3 for 71% of all samples for the restricted set. < 1.3 for 43% of all samples for the unrestricted set. UCHPC 215, August 24th,

31 Evaluation Quantitative Evaluation Comparison t prediction t result (e) Cache-Aware Model t prediction t result (f) Simple Model.7 < t prediction t result.7 < t prediction t result < 1.3 for 71% of all samples for out model. < 1.3 for 61% of all samples for the simpler model. UCHPC 215, August 24th,

32 Further Work Improve predictions, expand onto more architectures Support more language constructs, e.g. if or for Support intrinsic operations, e.g. sin(), sqrt() UCHPC 215, August 24th, 215 3

33 Thank you for your attention Acknowledgements This work was partly supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre Invasive Computing (SFB/TR 89). UCHPC 215, August 24th,

Communication Library to Overlap Computation and Communication for OpenCL Application

Communication Library to Overlap Computation and Communication for OpenCL Application Toshiya Komoda, Shinobu Miwa, Hiroshi Nakamura Univ.Tokyo What is today s talk about? Heterogeneous Computing System