On-the-Fly Elimination of Dynamic Irregularities for GPU Computing

Size: px
Start display at page:

Download "On-the-Fly Elimination of Dynamic Irregularities for GPU Computing"

Transcription

1 On-the-Fly Elimination of Dynamic Irregularities for GPU Computing Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen

2 Graphic Processing Units (GPU) 2

3 Graphic Processing Units (GPU) 2

4 Graphic Processing Unit (GPU) Massive parallelism Favorable computing power cost effectiveness energy efficiency 3

5 Graphic Processing Unit (GPU) a SIMD group (warp) Massive parallelism Favorable computing power cost effectiveness energy efficiency 3

6 Graphic Processing Unit (GPU) Great for regular computation! a SIMD group (warp) Massive parallelism Favorable computing power cost effectiveness energy efficiency 3

7 Graphic Processing Unit (GPU) Great for regular computation! a SIMD group (warp) Massive parallelism Favorable computing power cost effectiveness energy efficiency 3

8 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7} A[ ]: { a mem seg. 4

9 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7} A[ ]: { a mem seg. 4

10 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7} A[ ]: { a mem seg. 4

11 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} A[ ]: { a mem seg. 4

12 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} A[ ]: { a mem seg. 4

13 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} A[ ]: { a mem seg. 4

14 Dynamic Irregularities memory control flow (thread divergence) P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} if (A[tid]) {...}... = A[P[tid]]; A[ ]: A[ ]: { a mem seg. 4

15 Dynamic Irregularities memory control flow (thread divergence) P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} if (A[tid]) {...}... = A[P[tid]]; A[ ]: A[ ]: { a mem seg. 4

16 Dynamic Irregularities memory control flow (thread divergence)... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} if (A[tid]) {...} A[ ]: { a mem seg. A[ ]: for (i=0;i<a[tid]; i++) {...} 4

17 Dynamic Irregularities memory control flow (thread divergence)... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} if (A[tid]) {...} A[ ]: { a mem seg. A[ ]: for (i=0;i<a[tid]; i++) {...} Degrade throughput by up to (warp size - 1) times. (warp size = 32 in modern GPUs) 4

18 Performance Impact Applications: Dynamic programming, fluid simulation, image reconstruction, data mining, Potential Speedup Host: Xeon Device: Tesla HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 5

19 Previous Work Mostly on Static Memory Irregularities [Baskaran+, ICS 08], [Lee+, PPoPP 09],[Yang+, PLDI 10], etc. Dynamic Irregularity Remain unknown until runtime Mostly through hardware extensions. [Meng+, ISCA 10], [Tarjan+, SC 09], [Fung+, MICRO 07], etc. 6

20 Open Questions for Dyn. Irreg. Removal Is it possible to do so w/o hw ext.? 7

21 Open Questions for Dyn. Irreg. Removal Is it possible to do so w/o hw ext.? More fundamentally Relations among data, threads, & irregularities? What layout or thread-data mappings minimize the irreg.? How to find them? Complexity? How to resolve conflicts among irregularities? Dependences? 7

22 Overview of this Work Analytic findings on properties of dyn. irreg. removal A software solution: G-Streamline library No profiling or hw ext. Transparent removal on the fly Jeopardize no basic efficiency Treat both types of irreg. holistically 8

23 Basic Insight: Both mem & control irreg. stem from inferior thread-data mappings. Transformations data reorder. hybrid job swap. data reloc. ref. redirect. A unified treatment. 9

24 Basic Insight: Both mem & control irreg. stem from inferior thread-data mappings. Transformations data reorder. hybrid job swap. data reloc. ref. redirect. A unified treatment. Two basic mechanisms: data relocation reference redirection A[p[tid]] -> A[q[tid]] Compose three transformations. 9

25 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: tid: thread ID; : a thread; : data access; : data swapping 10

26 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10

27 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10

28 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: maintain mapping between threads & data values A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10

29 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: maintain mapping between threads & data values A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10

30 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: maintain mapping between threads & data values A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10

31 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> maintain mapping between threads & data values transformed... = A [Q[tid]]; A [ ]: Q[ ] = {0,1,2,3,2,3,6,7} tid: thread ID; : a thread; : data access; : data swapping 10

32 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: 11

33 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: 11

34 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: 11

35 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: A[ ]: 11

36 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: A[ ]: 11

37 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <redirection> transformed newtid = Q[tid]; = A[P[newtid]]; Q[ ] = {0,4,2,3,1,5,6,7} A[ ]: A[ ]: 11

38 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed Both reduce 1 mem trans. original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <redirection> A[ ]: 1 more than the optimal... transformed newtid = Q[tid]; = A[P[newtid]]; Q[ ] = {0,4,2,3,1,5,6,7} A[ ]: 11

39 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} 12

40 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} 12

41 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} B[ ]:

42 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} B[ ]:

43 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} B[ ]:

44 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} <redirection> transformed newtid = D[tid]; if (B[newtid]) {...} D[ ] = {0,1,4,3,2,5,6,7} B[ ]:

45 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} <redirection> transformed newtid = D[tid]; if (B[newtid]) {...} D[ ] = {0,1,4,3,2,5,6,7} B[ ]: Mem ref. pattern changes. 12

46 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} <redirection> transformed newtid = D[tid]; if (B[newtid]) {...} D[ ] = {0,1,4,3,2,5,6,7} B[ ]: Mem ref. pattern changes. Solution: A follow-up data reordering. 12

47 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} method 1 <redirection> transformed newtid = D[tid]; if (B[newtid]) {...} D[ ] = {0,1,4,3,2,5,6,7} B[ ]: Mem ref. pattern changes. Solution: A follow-up data reordering. 12

48 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]:

49 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: <relocation> B [ ]:

50 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: transformed <relocation> if (B [tid]) {...} B [ ]:

51 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: transformed <relocation> if (B [tid]) {...} B [ ]:

52 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: transformed <relocation> Job integrity if (B[tid]-tid) {...} if (B [tid]) {...} B [ ]:

53 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: transformed <relocation> Job integrity if (B[tid]-tid) {...} if (B [tid]) {...} B [ ]: newtid = Q[tid]; if (B [tid]-newtid) {...} Q[ ] = {0,1,4,3,2,5,6,7} 13

54 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap 14

55 Trans-3: Hybrid Job swap + Data reorder original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: Data reorder + Job swap 14

56 Trans-3: Hybrid Job swap + Data reorder original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: Data reorder + Job swap Single opt reduces 1 mem trans. 1 more than optimal. 14

57 Trans-3: Hybrid Job swap + Data reorder data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: Data reorder + Job swap Single opt reduces 1 mem trans. 1 more than optimal. 14

58 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. 1 more than optimal. 14

59 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap Single opt reduces 1 mem trans. data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} 1 more than optimal. 14

60 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap Single opt reduces 1 mem trans. data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} 1 more than optimal. 14

61 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. job swapping 1 more than optimal. 14

62 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. job swapping 1 more than optimal. 14

63 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. 1 more than optimal. job swapping <redirection> transformed ntid = R[tid];... = A [Q[ntid]]; R[ ] = {4,5,2,3,0,1,6,7} 14

64 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. 1 more than optimal. job swapping <redirection> transformed ntid = R[tid];... = A [Q[ntid]]; A [ ]: R[ ] = {4,5,2,3,0,1,6,7} 14

65 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Optimal achieved. job swapping <redirection> transformed ntid = R[tid];... = A [Q[ntid]]; A [ ]: R[ ] = {4,5,2,3,0,1,6,7} 14

66 Comparisons Irreg. Mem Diff. applicability of reordering and job swapping Hybrid: largest potential Irreg. Control Job swapping by redirection lower overhead, but with side effects Job swapping by relocation higher overhead, no side effects 15

67 guidance for transformations G-Streamline Transformations data hybrid reorder. data reloc. job swap. ref. redirect. Optimality & approximation Complexity analysis Approximating optimal layouts and mappings overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation How to determine optimal layouts / thread-data mapping? 16

68 NP-Complete Layout: 3D matching Approx. Duplication/packing Mapping: Partition Problem guidance for transformations Transformations data hybrid reorder. data reloc. job swap. ref. redirect. Optimality & approximation Clustered sorting How to determine G-Streamline Complexity analysis Approximating optimal layouts and mappings overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation optimal layouts / thread-data mapping? 16

69 6 After Transformation Benchmark Suites: Rodinia, Tesla Bio, and etc. Host: Xeon Device: Tesla Without Overhead With Overhead Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 17

70 6 After Transformation Benchmark Suites: Rodinia, Tesla Bio, and etc. Host: Xeon Device: Tesla Without Overhead With Overhead Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 17

71 6 After Transformation Benchmark Suites: Rodinia, Tesla Bio, and etc. Host: Xeon Device: Tesla Without Overhead With Overhead Speedup How to minimize or hide overhead? HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 17

72 G-Streamline overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation Transparent, on-the-fly Adaptive to pattern changes No perf. degradation Resilient to dependence Automatically balance benefits and overhead 18

73 G-Streamline CPU-GPU pipelining Kernel splitting Partial transf. and overlap. Two-level adaptive control overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation Transparent, on-the-fly Adaptive to pattern changes No perf. degradation Resilient to dependence Automatically balance benefits and overhead 18

74 G-Streamline CPU-GPU pipelining Kernel splitting Partial transf. and overlap. Two-level adaptive control overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation Transparent, on-the-fly Adaptive to pattern changes No perf. degradation Resilient to dependence Automatically balance benefits and overhead 18

75 CPU-GPU Pipelining Utilize Idle CPU Time Transform on CPU while computing on GPU Automatic shutdown when necessary for i=1:n async_transform (i+2); async_copy (i+2); gpu_kernel(i); end CPU cpu_transform( ) 1 GPU CPU 2 GPU MEM CPU 3 4 GPU GPU MEM CPU MEM 5 GPU 6 GPU MEM GPU copy_to_gpu gpu_kernel CPU MEM CPU MEM 19

76 Dependence or No Loop CFD (grid Euler solver) for i=1:iterations... cuda_compute_flux(...); // write A, read B cuda_time_step(...); // read A, write B... end 20

77 Dependence or No Loop CFD (grid Euler solver) CUDA-EC (DNA error correction) for i=1:iterations... cuda_compute_flux(...); // write A, read B cuda_time_step(...); // read A, write B... end main (){ gpu_fix_errors1(); } 20

78 Kernel Splitting gpukernel_org<<<...>>>(pdata,...); split pipeline gpukernel_org_sub<<<...>>>(pdata,0, (1-r)*len,...); gpukernel_opt_sub<<<...>>>(pdata,(1-r)*len+1, len,...); Also useful for loops with no dependences Enable partial transformation 21

79 CPU-GPU pipelining Kernel splitting Partial transf. and overlap. Two-level adaptive control overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Two-level efficiency-driven adaptation Transparent, on-the-fly Adaptive to pattern changes No perf. degradation Resilient to dependence Automatically balance benefits and overhead 22

80 Final Speedup Basic transformation w/ efficiency control full potential Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 23

81 Final Speedup Basic transformation w/ efficiency control full potential Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 23

82 Final Speedup Basic transformation w/ efficiency control full potential Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 23

83 HW Profiling Results Benchmark 3D-LBM CFD CG NN Unwrap CUDA-EC GPU-HMMER Irreg. Type Memory Transactions Reduction Divergent Branch Reduction div & mem 10% 99% mem 12% mem 19% mem 67% mem 19% div 10% div 99% 24

84 Conclusions Comprehensive Understanding Computational complexity Relations among threads, data, and irregularities G-Streamline: A Unified, Software Solution Treat both mem. and control irregularities Need no offline prof. or hw ext. A whole-system synergy 25

85 Thanks! 26

86 Adaptive Efficiency Control Automatic Shutdown Early Termination if Necessary Monitor + Scheduler Whole-system communication and status track GPU status through cudastreamquery() Coordination of CPU, GPU, and data transfer Transform. Tasks GPU Computation Tasks CPU output MEM BUS 27

87 Adaptive Efficiency Control Automatic Shutdown Early Termination if Necessary Monitor + Scheduler Whole-system communication and status track GPU status through cudastreamquery() Coordination of CPU, GPU, and data transfer Transform. Tasks GPU Computation Tasks Scheduler CPU output MEM BUS 27

88 Adaptive Efficiency Control Automatic Shutdown Early Termination if Necessary Monitor + Scheduler Whole-system communication and status track GPU status through cudastreamquery() Coordination of CPU, GPU, and data transfer Monitor Transform. Tasks GPU Computation Tasks Scheduler CPU output MEM BUS 27

89 Adaptive Efficiency Control Fine-Grained Adaptation To Determine a Good Splitting Ratio 1. Transparent on-line profiling in first several iterations 2. Build linear model ==> Opt. Ratio = Ropt 3. Continue adjusting Ropt through a runtime control system Fast converge to near optimal Avoid unnecessary oscillations Avoid being misled by early shutdown See Paper Please. 28

Towards Intelligent Programing Systems for Modern Computing

Towards Intelligent Programing Systems for Modern Computing Towards Intelligent Programing Systems for Modern Computing Xipeng Shen Computer Science, North Carolina State University Unprecedented Scale 50X perf Y202X 1000 petaflops 20mw power 2X power! Y2011 20

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

LDetector: A low overhead data race detector for GPU programs

LDetector: A low overhead data race detector for GPU programs LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness

More information

Object Support in an Array-based GPGPU Extension for Ruby

Object Support in an Array-based GPGPU Extension for Ruby Object Support in an Array-based GPGPU Extension for Ruby ARRAY 16 Matthias Springer, Hidehiko Masuhara Dept. of Mathematical and Computing Sciences, Tokyo Institute of Technology June 14, 2016 Object

More information

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Compiling for GPUs. Adarsh Yoga Madhav Ramesh Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation

More information

CS 314 Principles of Programming Languages

CS 314 Principles of Programming Languages CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1, Ari B. Hayes 1, Chi Zhang 2, Timothy Salmon 1, Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh Sparse

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

A Cross-Input Adaptive Framework for GPU Program Optimizations

A Cross-Input Adaptive Framework for GPU Program Optimizations A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang, Xipeng Shen Computer Science Department The College of William & Mary Outline GPU overview G-Adapt Framework Evaluation

More information

Benchmarks Based on Anti-Parallel Patterns for the Evaluation of GPUs

Benchmarks Based on Anti-Parallel Patterns for the Evaluation of GPUs Benchmarks Based on Anti-Parallel Patterns for the Evaluation of GPUs International Conference on Parallel Computing (ParCo) 11, Ghent, Belgium Jan G. CORNELIS, and Jan LEMEIRE Vrije Universiteit Brussel

More information

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Locality-Aware Mapping of Nested Parallel Patterns on GPUs Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

CS671 Parallel Programming in the Many-Core Era

CS671 Parallel Programming in the Many-Core Era CS671 Parallel Programming in the Many-Core Era Lecture 3: GPU Programming - Reduce, Scan & Sort Zheng Zhang Rutgers University Review: Programming with CUDA An Example in C Add vector A and vector B to

More information

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23, 2017 1 Contents Motivation Background Search Space Reduction for

More information

Fine-Grained Treatment to Synchronizations in GPU-to-CPU Translation

Fine-Grained Treatment to Synchronizations in GPU-to-CPU Translation Fine-Grained Treatment to Synchronizations in GPU-to-CPU Translation Ziyu Guo and Xipeng Shen College of William and Mary, Williamsburg VA 23187, USA, {guoziyu, xshen}@cs.wm.edu Abstract. GPU-to-CPU translation

More information

An Input-Centric Paradigm for Program Dynamic Optimizations

An Input-Centric Paradigm for Program Dynamic Optimizations An Input-Centric Paradigm for Program Dynamic Optimizations Kai Tian, Yunlian Jiang,, Eddy Zhang, Xipeng Shen (presenter) College of William and Mary (Published in OOPSLA 10) Program Optimizations 1950s

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

College of William & Mary Department of Computer Science

College of William & Mary Department of Computer Science Technical Report WM-CS-2011-02 College of William & Mary Department of Computer Science WM-CS-2011-02 Fine-Grained Treatment to Synchronizations in GPU-to-CPU Translation Ziyu Guo and Xipeng Shen June

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids

More information

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training

More information

Numerical Algorithms on Multi-GPU Architectures

Numerical Algorithms on Multi-GPU Architectures Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications

More information

Progress in automatic GPU compilation and why you want to run MPI on your GPU

Progress in automatic GPU compilation and why you want to run MPI on your GPU TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016 #pragma ivdep 2 !$ACC DATA

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Yunsup Lee UC Berkeley 1

Yunsup Lee UC Berkeley 1 Yunsup Lee UC Berkeley 1 Why is Supporting Control Flow Challenging in Data-Parallel Architectures? for (i=0; i

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Fast Tridiagonal Solvers on GPU

Fast Tridiagonal Solvers on GPU Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Presented at the 2014 ANSYS Regional Conference- Detroit, June 5, 2014 Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Bhushan Desam, Ph.D. NVIDIA Corporation 1 NVIDIA Enterprise

More information

One Stone Two Birds: Synchronization Relaxation and Redundancy Removal in GPU-CPU Translation

One Stone Two Birds: Synchronization Relaxation and Redundancy Removal in GPU-CPU Translation One Stone Two Birds: Synchronization Relaxation and Redundancy Removal in GPU-CPU Translation Ziyu Guo Qualcomm CDMA Technologies, San Diego, CA, USA guoziyu@qualcomm.com Bo Wu Computer Science Department

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Heterogeneous platforms

Heterogeneous platforms Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk

More information

Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors?

Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? Yunlian Jiang Eddy Z. Zhang Kai Tian Xipeng Shen (presenter) Department of Computer Science, VA, USA Cache Sharing A common

More information

Peta-Scale Simulations with the HPC Software Framework walberla:

Peta-Scale Simulations with the HPC Software Framework walberla: Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager,

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

Fast Segmented Sort on GPUs

Fast Segmented Sort on GPUs Fast Segmented Sort on GPUs Kaixi Hou, Weifeng Liu, Hao Wang, Wu-chun Feng {kaixihou, hwang121, wfeng}@vt.edu weifeng.liu@nbi.ku.dk Segmented Sort (SegSort) Perform a segment-by-segment sort on a given

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

A MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS. A Thesis. presented to. the Faculty of California Polytechnic State University

A MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS. A Thesis. presented to. the Faculty of California Polytechnic State University A MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS A Thesis presented to the Faculty of California Polytechnic State University San Luis Obispo In Partial Fulfillment of the Requirements

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CUB. collective software primitives. Duane Merrill. NVIDIA Research CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives

More information

Introduction to GPU computing

Introduction to GPU computing Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications

Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications Tasneem Brutch, Bo Li, Guodong Rong, Yi Shen, Chang Shu Samsung Research America-Silicon Valley {t.brutch, robert.li, g.rong,

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX Wen-mei Hwu with David Kirk, Shane Ryoo, Christopher Rodrigues, John Stratton, Kuangwei Huang Overview

More information

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Breadth-first Search in CUDA

Breadth-first Search in CUDA Assignment 3, Problem 3, Practical Parallel Computing Course Breadth-first Search in CUDA Matthias Springer Tokyo Institute of Technology matthias.springer@acm.org Abstract This work presents a number

More information

Parallel Implementation of VLSI Gate Placement in CUDA

Parallel Implementation of VLSI Gate Placement in CUDA ME 759: Project Report Parallel Implementation of VLSI Gate Placement in CUDA Movers and Placers Kai Zhao Snehal Mhatre December 21, 2015 1 Table of Contents 1. Introduction...... 3 2. Problem Formulation...

More information

Optimizing Parallel Reduction in CUDA

Optimizing Parallel Reduction in CUDA Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each

More information

A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP

A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP Kai Tian Kai Tian, Yunlian Jiang and Xipeng Shen Computer Science Department, College of William and Mary, Virginia, USA 5/18/2009 Cache

More information

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs 14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs K. Esler, D. Dembeck, K. Mukundakrishnan, V. Natoli, J. Shumway and Y. Zhang Stone Ridge Technology, Bel Air, MD

More information

State of Art and Project Proposals Intensive Computation

State of Art and Project Proposals Intensive Computation State of Art and Project Proposals Intensive Computation Annalisa Massini - 2015/2016 Today s lecture Project proposals on the following topics: Sparse Matrix- Vector Multiplication Tridiagonal Solvers

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei H. Hwu

More information