On-the-Fly Elimination of Dynamic Irregularities for GPU Computing
|
|
- Donna Violet Craig
- 5 years ago
- Views:
Transcription
1 On-the-Fly Elimination of Dynamic Irregularities for GPU Computing Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen
2 Graphic Processing Units (GPU) 2
3 Graphic Processing Units (GPU) 2
4 Graphic Processing Unit (GPU) Massive parallelism Favorable computing power cost effectiveness energy efficiency 3
5 Graphic Processing Unit (GPU) a SIMD group (warp) Massive parallelism Favorable computing power cost effectiveness energy efficiency 3
6 Graphic Processing Unit (GPU) Great for regular computation! a SIMD group (warp) Massive parallelism Favorable computing power cost effectiveness energy efficiency 3
7 Graphic Processing Unit (GPU) Great for regular computation! a SIMD group (warp) Massive parallelism Favorable computing power cost effectiveness energy efficiency 3
8 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7} A[ ]: { a mem seg. 4
9 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7} A[ ]: { a mem seg. 4
10 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7} A[ ]: { a mem seg. 4
11 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} A[ ]: { a mem seg. 4
12 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} A[ ]: { a mem seg. 4
13 memory Dynamic Irregularities... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} A[ ]: { a mem seg. 4
14 Dynamic Irregularities memory control flow (thread divergence) P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} if (A[tid]) {...}... = A[P[tid]]; A[ ]: A[ ]: { a mem seg. 4
15 Dynamic Irregularities memory control flow (thread divergence) P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} if (A[tid]) {...}... = A[P[tid]]; A[ ]: A[ ]: { a mem seg. 4
16 Dynamic Irregularities memory control flow (thread divergence)... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} if (A[tid]) {...} A[ ]: { a mem seg. A[ ]: for (i=0;i<a[tid]; i++) {...} 4
17 Dynamic Irregularities memory control flow (thread divergence)... = A[P[tid]]; P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} if (A[tid]) {...} A[ ]: { a mem seg. A[ ]: for (i=0;i<a[tid]; i++) {...} Degrade throughput by up to (warp size - 1) times. (warp size = 32 in modern GPUs) 4
18 Performance Impact Applications: Dynamic programming, fluid simulation, image reconstruction, data mining, Potential Speedup Host: Xeon Device: Tesla HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 5
19 Previous Work Mostly on Static Memory Irregularities [Baskaran+, ICS 08], [Lee+, PPoPP 09],[Yang+, PLDI 10], etc. Dynamic Irregularity Remain unknown until runtime Mostly through hardware extensions. [Meng+, ISCA 10], [Tarjan+, SC 09], [Fung+, MICRO 07], etc. 6
20 Open Questions for Dyn. Irreg. Removal Is it possible to do so w/o hw ext.? 7
21 Open Questions for Dyn. Irreg. Removal Is it possible to do so w/o hw ext.? More fundamentally Relations among data, threads, & irregularities? What layout or thread-data mappings minimize the irreg.? How to find them? Complexity? How to resolve conflicts among irregularities? Dependences? 7
22 Overview of this Work Analytic findings on properties of dyn. irreg. removal A software solution: G-Streamline library No profiling or hw ext. Transparent removal on the fly Jeopardize no basic efficiency Treat both types of irreg. holistically 8
23 Basic Insight: Both mem & control irreg. stem from inferior thread-data mappings. Transformations data reorder. hybrid job swap. data reloc. ref. redirect. A unified treatment. 9
24 Basic Insight: Both mem & control irreg. stem from inferior thread-data mappings. Transformations data reorder. hybrid job swap. data reloc. ref. redirect. A unified treatment. Two basic mechanisms: data relocation reference redirection A[p[tid]] -> A[q[tid]] Compose three transformations. 9
25 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: tid: thread ID; : a thread; : data access; : data swapping 10
26 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10
27 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10
28 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: maintain mapping between threads & data values A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10
29 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: maintain mapping between threads & data values A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10
30 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <relocation> A[ ]: maintain mapping between threads & data values A [ ]: tid: thread ID; : a thread; : data access; : data swapping 10
31 Trans-1: Data Reordering (for mem irreg only) original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> maintain mapping between threads & data values transformed... = A [Q[tid]]; A [ ]: Q[ ] = {0,1,2,3,2,3,6,7} tid: thread ID; : a thread; : data access; : data swapping 10
32 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: 11
33 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: 11
34 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: 11
35 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: A[ ]: 11
36 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: A[ ]: 11
37 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <redirection> transformed newtid = Q[tid]; = A[P[newtid]]; Q[ ] = {0,4,2,3,1,5,6,7} A[ ]: A[ ]: 11
38 Trans-2: Job Swapping (for mem) Job = operations + data elements accessed Both reduce 1 mem trans. original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} <redirection> A[ ]: 1 more than the optimal... transformed newtid = Q[tid]; = A[P[newtid]]; Q[ ] = {0,4,2,3,1,5,6,7} A[ ]: 11
39 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} 12
40 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} 12
41 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} B[ ]:
42 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} B[ ]:
43 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} B[ ]:
44 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} <redirection> transformed newtid = D[tid]; if (B[newtid]) {...} D[ ] = {0,1,4,3,2,5,6,7} B[ ]:
45 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} <redirection> transformed newtid = D[tid]; if (B[newtid]) {...} D[ ] = {0,1,4,3,2,5,6,7} B[ ]: Mem ref. pattern changes. 12
46 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} <redirection> transformed newtid = D[tid]; if (B[newtid]) {...} D[ ] = {0,1,4,3,2,5,6,7} B[ ]: Mem ref. pattern changes. Solution: A follow-up data reordering. 12
47 Trans-2: Job Swapping (for control) original B[ ]: if (B[tid]) {...} method 1 <redirection> transformed newtid = D[tid]; if (B[newtid]) {...} D[ ] = {0,1,4,3,2,5,6,7} B[ ]: Mem ref. pattern changes. Solution: A follow-up data reordering. 12
48 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]:
49 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: <relocation> B [ ]:
50 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: transformed <relocation> if (B [tid]) {...} B [ ]:
51 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: transformed <relocation> if (B [tid]) {...} B [ ]:
52 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: transformed <relocation> Job integrity if (B[tid]-tid) {...} if (B [tid]) {...} B [ ]:
53 Trans-2: Job Swapping (for control) original if (B[tid]) {...} method 2 B[ ]: transformed <relocation> Job integrity if (B[tid]-tid) {...} if (B [tid]) {...} B [ ]: newtid = Q[tid]; if (B [tid]-newtid) {...} Q[ ] = {0,1,4,3,2,5,6,7} 13
54 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap 14
55 Trans-3: Hybrid Job swap + Data reorder original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: Data reorder + Job swap 14
56 Trans-3: Hybrid Job swap + Data reorder original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: Data reorder + Job swap Single opt reduces 1 mem trans. 1 more than optimal. 14
57 Trans-3: Hybrid Job swap + Data reorder data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: Data reorder + Job swap Single opt reduces 1 mem trans. 1 more than optimal. 14
58 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. 1 more than optimal. 14
59 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap Single opt reduces 1 mem trans. data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} 1 more than optimal. 14
60 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap Single opt reduces 1 mem trans. data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} 1 more than optimal. 14
61 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. job swapping 1 more than optimal. 14
62 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. job swapping 1 more than optimal. 14
63 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. 1 more than optimal. job swapping <redirection> transformed ntid = R[tid];... = A [Q[ntid]]; R[ ] = {4,5,2,3,0,1,6,7} 14
64 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Single opt reduces 1 mem trans. 1 more than optimal. job swapping <redirection> transformed ntid = R[tid];... = A [Q[ntid]]; A [ ]: R[ ] = {4,5,2,3,0,1,6,7} 14
65 Trans-3: Hybrid Job swap + Data reorder Data reorder + Job swap data reordering original... = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> A [ ]:... = A [Q[tid]]; Q[ ] = {4,5,2,3,2,3,7,6} Optimal achieved. job swapping <redirection> transformed ntid = R[tid];... = A [Q[ntid]]; A [ ]: R[ ] = {4,5,2,3,0,1,6,7} 14
66 Comparisons Irreg. Mem Diff. applicability of reordering and job swapping Hybrid: largest potential Irreg. Control Job swapping by redirection lower overhead, but with side effects Job swapping by relocation higher overhead, no side effects 15
67 guidance for transformations G-Streamline Transformations data hybrid reorder. data reloc. job swap. ref. redirect. Optimality & approximation Complexity analysis Approximating optimal layouts and mappings overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation How to determine optimal layouts / thread-data mapping? 16
68 NP-Complete Layout: 3D matching Approx. Duplication/packing Mapping: Partition Problem guidance for transformations Transformations data hybrid reorder. data reloc. job swap. ref. redirect. Optimality & approximation Clustered sorting How to determine G-Streamline Complexity analysis Approximating optimal layouts and mappings overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation optimal layouts / thread-data mapping? 16
69 6 After Transformation Benchmark Suites: Rodinia, Tesla Bio, and etc. Host: Xeon Device: Tesla Without Overhead With Overhead Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 17
70 6 After Transformation Benchmark Suites: Rodinia, Tesla Bio, and etc. Host: Xeon Device: Tesla Without Overhead With Overhead Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 17
71 6 After Transformation Benchmark Suites: Rodinia, Tesla Bio, and etc. Host: Xeon Device: Tesla Without Overhead With Overhead Speedup How to minimize or hide overhead? HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 17
72 G-Streamline overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation Transparent, on-the-fly Adaptive to pattern changes No perf. degradation Resilient to dependence Automatically balance benefits and overhead 18
73 G-Streamline CPU-GPU pipelining Kernel splitting Partial transf. and overlap. Two-level adaptive control overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation Transparent, on-the-fly Adaptive to pattern changes No perf. degradation Resilient to dependence Automatically balance benefits and overhead 18
74 G-Streamline CPU-GPU pipelining Kernel splitting Partial transf. and overlap. Two-level adaptive control overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Three-level efficiency-driven adaptation Transparent, on-the-fly Adaptive to pattern changes No perf. degradation Resilient to dependence Automatically balance benefits and overhead 18
75 CPU-GPU Pipelining Utilize Idle CPU Time Transform on CPU while computing on GPU Automatic shutdown when necessary for i=1:n async_transform (i+2); async_copy (i+2); gpu_kernel(i); end CPU cpu_transform( ) 1 GPU CPU 2 GPU MEM CPU 3 4 GPU GPU MEM CPU MEM 5 GPU 6 GPU MEM GPU copy_to_gpu gpu_kernel CPU MEM CPU MEM 19
76 Dependence or No Loop CFD (grid Euler solver) for i=1:iterations... cuda_compute_flux(...); // write A, read B cuda_time_step(...); // read A, write B... end 20
77 Dependence or No Loop CFD (grid Euler solver) CUDA-EC (DNA error correction) for i=1:iterations... cuda_compute_flux(...); // write A, read B cuda_time_step(...); // read A, write B... end main (){ gpu_fix_errors1(); } 20
78 Kernel Splitting gpukernel_org<<<...>>>(pdata,...); split pipeline gpukernel_org_sub<<<...>>>(pdata,0, (1-r)*len,...); gpukernel_opt_sub<<<...>>>(pdata,(1-r)*len+1, len,...); Also useful for loops with no dependences Enable partial transformation 21
79 CPU-GPU pipelining Kernel splitting Partial transf. and overlap. Two-level adaptive control overhead hiding & minimization Efficiency control Adaptive CPU-GPU pipelining Two-level efficiency-driven adaptation Transparent, on-the-fly Adaptive to pattern changes No perf. degradation Resilient to dependence Automatically balance benefits and overhead 22
80 Final Speedup Basic transformation w/ efficiency control full potential Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 23
81 Final Speedup Basic transformation w/ efficiency control full potential Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 23
82 Final Speedup Basic transformation w/ efficiency control full potential Speedup HMMER 3D-LBM CUDA-EC NN CFD CG Unwrap 23
83 HW Profiling Results Benchmark 3D-LBM CFD CG NN Unwrap CUDA-EC GPU-HMMER Irreg. Type Memory Transactions Reduction Divergent Branch Reduction div & mem 10% 99% mem 12% mem 19% mem 67% mem 19% div 10% div 99% 24
84 Conclusions Comprehensive Understanding Computational complexity Relations among threads, data, and irregularities G-Streamline: A Unified, Software Solution Treat both mem. and control irregularities Need no offline prof. or hw ext. A whole-system synergy 25
85 Thanks! 26
86 Adaptive Efficiency Control Automatic Shutdown Early Termination if Necessary Monitor + Scheduler Whole-system communication and status track GPU status through cudastreamquery() Coordination of CPU, GPU, and data transfer Transform. Tasks GPU Computation Tasks CPU output MEM BUS 27
87 Adaptive Efficiency Control Automatic Shutdown Early Termination if Necessary Monitor + Scheduler Whole-system communication and status track GPU status through cudastreamquery() Coordination of CPU, GPU, and data transfer Transform. Tasks GPU Computation Tasks Scheduler CPU output MEM BUS 27
88 Adaptive Efficiency Control Automatic Shutdown Early Termination if Necessary Monitor + Scheduler Whole-system communication and status track GPU status through cudastreamquery() Coordination of CPU, GPU, and data transfer Monitor Transform. Tasks GPU Computation Tasks Scheduler CPU output MEM BUS 27
89 Adaptive Efficiency Control Fine-Grained Adaptation To Determine a Good Splitting Ratio 1. Transparent on-line profiling in first several iterations 2. Build linear model ==> Opt. Ratio = Ropt 3. Continue adjusting Ropt through a runtime control system Fast converge to near optimal Avoid unnecessary oscillations Avoid being misled by early shutdown See Paper Please. 28
Towards Intelligent Programing Systems for Modern Computing
Towards Intelligent Programing Systems for Modern Computing Xipeng Shen Computer Science, North Carolina State University Unprecedented Scale 50X perf Y202X 1000 petaflops 20mw power 2X power! Y2011 20
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationCache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs
Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationLDetector: A low overhead data race detector for GPU programs
LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness
More informationObject Support in an Array-based GPGPU Extension for Ruby
Object Support in an Array-based GPGPU Extension for Ruby ARRAY 16 Matthias Springer, Hidehiko Masuhara Dept. of Mathematical and Computing Sciences, Tokyo Institute of Technology June 14, 2016 Object
More informationCompiling for GPUs. Adarsh Yoga Madhav Ramesh
Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation
More informationCS 314 Principles of Programming Languages
CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationLocality-Aware Software Throttling for Sparse Matrix Operation on GPUs
Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1, Ari B. Hayes 1, Chi Zhang 2, Timothy Salmon 1, Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh Sparse
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationA Cross-Input Adaptive Framework for GPU Program Optimizations
A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang, Xipeng Shen Computer Science Department The College of William & Mary Outline GPU overview G-Adapt Framework Evaluation
More informationBenchmarks Based on Anti-Parallel Patterns for the Evaluation of GPUs
Benchmarks Based on Anti-Parallel Patterns for the Evaluation of GPUs International Conference on Parallel Computing (ParCo) 11, Ghent, Belgium Jan G. CORNELIS, and Jan LEMEIRE Vrije Universiteit Brussel
More informationLocality-Aware Mapping of Nested Parallel Patterns on GPUs
Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationCS671 Parallel Programming in the Many-Core Era
CS671 Parallel Programming in the Many-Core Era Lecture 3: GPU Programming - Reduce, Scan & Sort Zheng Zhang Rutgers University Review: Programming with CUDA An Example in C Add vector A and vector B to
More informationCorolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion
Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion Minghua Shen and Guojie Luo Peking University FPGA-February 23, 2017 1 Contents Motivation Background Search Space Reduction for
More informationFine-Grained Treatment to Synchronizations in GPU-to-CPU Translation
Fine-Grained Treatment to Synchronizations in GPU-to-CPU Translation Ziyu Guo and Xipeng Shen College of William and Mary, Williamsburg VA 23187, USA, {guoziyu, xshen}@cs.wm.edu Abstract. GPU-to-CPU translation
More informationAn Input-Centric Paradigm for Program Dynamic Optimizations
An Input-Centric Paradigm for Program Dynamic Optimizations Kai Tian, Yunlian Jiang,, Eddy Zhang, Xipeng Shen (presenter) College of William and Mary (Published in OOPSLA 10) Program Optimizations 1950s
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationCollege of William & Mary Department of Computer Science
Technical Report WM-CS-2011-02 College of William & Mary Department of Computer Science WM-CS-2011-02 Fine-Grained Treatment to Synchronizations in GPU-to-CPU Translation Ziyu Guo and Xipeng Shen June
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationCUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata
CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids
More informationCtrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs
The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationAn Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs
An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationGPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC
GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationBuilding NVLink for Developers
Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized
More informationPouya Kousha Fall 2018 CSE 5194 Prof. DK Panda
Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training
More informationNumerical Algorithms on Multi-GPU Architectures
Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications
More informationProgress in automatic GPU compilation and why you want to run MPI on your GPU
TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser and Tobias Gysi @ SPCL presented at CCDSC, Lyon, France, 2016 #pragma ivdep 2 !$ACC DATA
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationYunsup Lee UC Berkeley 1
Yunsup Lee UC Berkeley 1 Why is Supporting Control Flow Challenging in Data-Parallel Architectures? for (i=0; i
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationFast Tridiagonal Solvers on GPU
Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationAuto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters
Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationSoftware and Performance Engineering for numerical codes on GPU clusters
Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3
More informationMaximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs
Presented at the 2014 ANSYS Regional Conference- Detroit, June 5, 2014 Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Bhushan Desam, Ph.D. NVIDIA Corporation 1 NVIDIA Enterprise
More informationOne Stone Two Birds: Synchronization Relaxation and Redundancy Removal in GPU-CPU Translation
One Stone Two Birds: Synchronization Relaxation and Redundancy Removal in GPU-CPU Translation Ziyu Guo Qualcomm CDMA Technologies, San Diego, CA, USA guoziyu@qualcomm.com Bo Wu Computer Science Department
More informationSpring Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp
More informationParallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationHeterogeneous platforms
Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk
More informationIs Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors?
Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? Yunlian Jiang Eddy Z. Zhang Kai Tian Xipeng Shen (presenter) Department of Computer Science, VA, USA Cache Sharing A common
More informationPeta-Scale Simulations with the HPC Software Framework walberla:
Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager,
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationFast Segmented Sort on GPUs
Fast Segmented Sort on GPUs Kaixi Hou, Weifeng Liu, Hao Wang, Wu-chun Feng {kaixihou, hwang121, wfeng}@vt.edu weifeng.liu@nbi.ku.dk Segmented Sort (SegSort) Perform a segment-by-segment sort on a given
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationExploring GPU Architecture for N2P Image Processing Algorithms
Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationA MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS. A Thesis. presented to. the Faculty of California Polytechnic State University
A MULTI-GPU COMPUTE SOLUTION FOR OPTIMIZED GENOMIC SELECTION ANALYSIS A Thesis presented to the Faculty of California Polytechnic State University San Luis Obispo In Partial Fulfillment of the Requirements
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationIntroduction to GPU computing
Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU
More informationParallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010
Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationWeb Physics: A Hardware Accelerated Physics Engine for Web- Based Applications
Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications Tasneem Brutch, Bo Li, Guodong Rong, Yi Shen, Chang Shu Samsung Research America-Silicon Valley {t.brutch, robert.li, g.rong,
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationPerformance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX
Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX Wen-mei Hwu with David Kirk, Shane Ryoo, Christopher Rodrigues, John Stratton, Kuangwei Huang Overview
More informationPorting a parallel rotor wake simulation to GPGPU accelerators using OpenACC
DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationBreadth-first Search in CUDA
Assignment 3, Problem 3, Practical Parallel Computing Course Breadth-first Search in CUDA Matthias Springer Tokyo Institute of Technology matthias.springer@acm.org Abstract This work presents a number
More informationParallel Implementation of VLSI Gate Placement in CUDA
ME 759: Project Report Parallel Implementation of VLSI Gate Placement in CUDA Movers and Placers Kai Zhao Snehal Mhatre December 21, 2015 1 Table of Contents 1. Introduction...... 3 2. Problem Formulation...
More informationOptimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each
More informationA Study on Optimally Co-scheduling Jobs of Different Lengths on CMP
A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP Kai Tian Kai Tian, Yunlian Jiang and Xipeng Shen Computer Science Department, College of William and Mary, Virginia, USA 5/18/2009 Cache
More information14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs
14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs K. Esler, D. Dembeck, K. Mukundakrishnan, V. Natoli, J. Shumway and Y. Zhang Stone Ridge Technology, Bel Air, MD
More informationState of Art and Project Proposals Intensive Computation
State of Art and Project Proposals Intensive Computation Annalisa Massini - 2015/2016 Today s lecture Project proposals on the following topics: Sparse Matrix- Vector Multiplication Tridiagonal Solvers
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationOptimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA
Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei H. Hwu
More information