Maximizing Face Detection Performance
|
|
- Reginald Henry
- 5 years ago
- Views:
Transcription
1 Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC
2 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount of work Improving cache behavior Note on feature format The points made apply to any cascaded classifier Face detection is just one example 2
3 Quick Review Slide a window around the image Use weak classifiers to detect object presence in each position I ll call a position a candidate Think of all the (x,y) positions that could be upper-left corners of a candidate window Each candidate is independent of all others -> easy opportunity to parallelize Cascade of weak-classifiers per candidate Some number of stages are cascaded Decision to continue/abort is made after each stage Each stage contains a number of weak-classifiers Evaluate some feature on the window, add its result to the running-stage sum Do this at multiple scales Classifiers are trained on small windows (~20x20 pixels) To detect objects of different sizes, do one of: Adjust the size of candidate windows (and scale features) Adjust (scale) image to match training window-size Group the candidates that passed the entire cascade 3
4 Input Image 4
5 Candidates that Pass All Stages 5
6 Candidates After Grouping 6
7 OpenCV haarcascade_frontalface_alt2.xml 20 stages 1047 weak-classifiers 2094 Haar-like features Each weak classifier is a 2- feature tree 4535 rectangles 1747 features contain 2 rects 347 features have 3 rects Idea is to reject more and more negatives with successive stages, passing through the positives Earlier stages are simpler for perf reasons Quickly reject negatives, reducing work for subsequent stages False-positives are OK, false-negatives are not OK 7
8 MBLBP Classifier 16 stages 451 features 4059 rects 419 unique features 8
9 Parallelization Ample opportunity for parallelization Scales are independent of each other Each scale has a (large) number of candidates, all independent A number of choices to be made: Number of threads per candidate window One or multiple threads per candidate Cascade stage processing All stages in a single or multiple kernel launches Scale processing In sequence (single stream) or concurrent (multiple streams) 9
10 Parallelization Ample opportunity for parallelization Scales are independent of each other Each scale has a (large) number of candidates, all independent A number of choices to be made: Number of threads per candidate window One or The multiple combination threads per of candidate choices can be Cascade stage overwhelming, processing so it helps to get some All stages intuition a single for or the multiple algorithm kernel launches operation Scale processing In sequence (single stream) or concurrent (multiple streams) 10
11 Input Image 11
12 Lighter = Candidate Passed More Stages 12
13 Lighter = Candidate Passed More Stages 13
14 Candidates Passing Stages 1920x1080 input image 5 scales: pixel faces 1.25x scaling factor Process each candidate Start with 478K candidates 254 pass all stages 14
15 Observations Adjacent candidates can pass very different number of stages Different amount of work for adjacent candidates The amount of candidates remaining decreases with the number of stages Often each stage rejects ~50% of candidates Depends on training parameters, etc. 15
16 Parallelization Choices 16
17 Chosen Parallelization One thread per candidate A thread iterates through the stages, deciding whether to continue after each stage Loop through the weak-classifiers for each stage Simple port: kernel code nearly identical to CPU code CPU-only code iterates through the candidates ( slides the window ) GPU code launches a thread for each candidate GPU kernel code = CPU loop body Two challenges: Different workloads per candidate (thus per thread) Having enough work to saturate a GPU 17
18 Challenge: Different Workloads GPU execution refresher: Threads are grouped into threadblocks Resources (thread IDs, registers, SMEM) are released only when all the threads in a block terminate Instructions are executed per warp (SIMT) 32 consecutive threads issue the same instruction Different code paths are allowed, threads get masked out during the path they don t take What these mean to cascades: If at least one thread in a warp needs to evaluate a stage, all 32 threads take the same time Inactive threads waste math pipelines If at least one thread in a threadblock needs to continue evaluating, the resources of all other threads in that block are not released Prevent new threads from starting right away 18
19 Instructions, time Challenge: Different Workloads GPU execution refresher: Threads are grouped into threadblocks Resources (thread IDs, registers, SMEM) are released only when all the threads in a block terminate Instructions are executed per warp (SIMT) 32 consecutive threads issue the same instruction Different code paths are allowed, threads get masked out during the path they don t take What these mean to cascades: If at least one thread in a warp needs to evaluate a stage, all 32 threads take the same time Inactive threads waste math pipelines If at least one thread in a threadblock needs to continue evaluating, the resources of all other threads in that block are not released Prevent new threads from starting right away 0 19
20 Challenge: Different Workloads GPU execution refresher: Threads are grouped into threadblocks Resources (thread IDs, registers, SMEM) are released only when all the threads in a block terminate Instructions are executed per warp (SIMT) 32 consecutive threads issue the same instruction Different code paths are allowed, threads get masked out during the path they don t take What these mean to cascades: If at least one thread in a warp needs to evaluate a stage, all 32 threads go through evaluation instructions Inactive threads waste math pipelines If at least one thread in a threadblock needs to continue evaluating, the resources of all other threads in that block are not released Prevent new threads from starting right away 20
21 Stage Processing Threads decide whether to terminate after each stage Could process all stages with a single kernel launch Potentially wasting the math and resources Could break stages into segments (work compaction ) A sequence of kernel launches, one per segment Maintain a work-queue Launch only as many threads as there are candidates in the queue At the end of each segment append the live candidates to the queue Use atomics for updating the index Work-queue maintenance adds some overhead Read/write queues (writes are atomic) Communicate queue size to CPU for subsequent launch 21
22 Stage Processing: Timing Results 20-stage classifier, TK1 1 segment: 127 ms (1-20 stages) 2 segments: 93 ms (1-3, 4-20 stages) 3 segments: 84 ms (1-3, 4-7, 8-20 stages) 16-stage classifier: 1 segment: 134 ms 2 segments: 126 ms (1-2, 3-16 stages) K40: 9.8 ms, 8.7 ms 22
23 Why I Didn t Choose SMEM Here SMEM could be used to store the integral image tile needed by a threadblock, but: SMEM makes scaling features impractical SMEM overhead becomes prohibitive, forcing us to scale images SMEM precludes work-compaction: A threadblock must cover a contiguous region to read all the inputs Preliminary test with another classifier showed very small difference between using SMEM or just reading via texture cache And the texture code was still scaling image (could have been avoided) Can use either texture functions, or ldg() with regular pointers Caution: the evidence isn t conclusive yet Classifiers that benefit little from compaction may benefit from SMEM 23
24 Why I Didn t Choose SMEM Here SMEM could be used to store the integral image tile needed by a threadblock, but: SMEM makes scaling features impractical SMEM overhead becomes prohibitive, forcing us to scale images SMEM precludes work-compaction: A threadblock must cover a contiguous region to read all the inputs Preliminary test with another classifier showed very small difference between using SMEM or just reading via texture cache And the texture code was still scaling image (could have been avoided) Can use either texture functions, or ldg() with regular pointers Caution: the evidence isn t conclusive yet Classifiers that benefit little from compaction may benefit from SMEM 24
25 Why I Didn t Choose SMEM Here SMEM could be used to store the integral image tile needed by a threadblock, but: SMEM makes scaling features impractical SMEM overhead becomes prohibitive, forcing us to scale images SMEM precludes work-compaction: A threadblock must cover a contiguous region to read all the inputs Preliminary test with another classifier showed very small difference between using SMEM or just reading via texture cache And the texture code was still scaling image (could have been avoided) Can use either texture functions, or ldg() with regular pointers Caution: the evidence isn t conclusive yet Classifiers that benefit little from compaction may benefit from SMEM 25
26 Challenge: Enough Work to Saturate a GPU We start out with 100s of thousands of candidates Plenty to saturate even the biggest GPUs We are left with fewer and fewer candidates as stages reject them Even 1-SM GPUs (TK1) will start idling Bigger GPUs will start idling sooner Two solutions: Process scales concurrently Switch parallelization after some number of stages 26
27 Challenge: Enough Work to Saturate a GPU We start out with 100s of thousands of candidates Plenty to saturate even the biggest GPUs We are left with fewer and fewer candidates as stages reject them Even 1-SM GPUs (TK1) will start idling Bigger GPUs will start idling sooner Two solutions: Process scales concurrently Switch parallelization after some number of stages 27
28 Challenge: Enough Work to Saturate a GPU We start out with 100s of thousands of candidates Plenty to saturate even the biggest GPUs We are left with fewer and fewer candidates as stages reject them Even 1-SM GPUs (TK1) will start idling Bigger GPUs will start idling sooner Two solutions: Process scales concurrently Switch parallelization after some number of stages 28
29 Concurrent Scale Processing Issue kernels for different scales into different streams Scales are independent Maintain a different work-queue for each scale So that features can be properly scaled Orthogonal to work-compaction: Loop through the segments For each segment launch as many kernels as you have scales GPU stream support in hw: TK1 supports 4 streams Other GPUs (Kepler and more recent) support 32 streams More streams can be used in sw, but will result in stream aliasing 29
30 TK1: 16-stage MBLBP Classifier Sequential Scale-0 Scale-1 Scale-2 Scale-3 Scale-4 Concurrent Concurrent 2 segments 30
31 K40: 16-stage MBLBP Classifier 9.7 ms 8.3 ms 2-segments 6.1 ms 31
32 TK1: 78.9 ms 20-stage Haar-like 5 scales of stages scales of stages 4-7 stages 8-20 K40: 7.2 ms 32
33 Switching Parallelizations One thread per candidate: Pro: candidates go through minimal stage count Con: GPU becomes latency limited After a number of stages there isn t enough work to hide latency Very rough rule of thumb: fewer than 512 threads per SM GPU becomes underutilized Alternative parallelization: Use multiple threads per candidate, say a warp A warp evaluates 32 features in parallel Performs a reduction (or prefix sum) to compute stage sum Power-of-2 up to a warp is nice because of the shfl/vote instructions May do unnecessary work A thread evaluates a feature it wouldn t have reached sequentially 33
34 Switching Parallelizations Idea: change parallelization when only a few 100 candidates remain Prior to that continue to use 1 thread/candidate Avoids inter-thread communication and unnecessary work Preliminary work on a different classifier: A few 100 features Speedup: K40: x (depending on image) TK1: 1.0x Results suggest that: Alternative parallelization helps when you have lots of stages with too few candidates to saturate the GPU Confirmed when TK1 ran a classifier with even more stages 34
35 Work Reduction 35
36 Reduce the Initial Number of Candidates Less work -> less time Will reach the point of non-saturated GPU sooner Makes concurrent scale processing even more useful Two ways to reduce the initial candidate count: Use a mask to not consider some candidates ROI, skin-tone, etc. Skip candidates (stride > 1) Post-process neighborhoods of rectangles that didn t get grouped 36
37 Skin Tone Mask Race invariant, simply needs a white-balanced camera Color density plots for asian, african, and caucasian skin from 37
38 Candidate Mask Mask pixel at (x,y) corresponds to upper-left corner of a candidate window Shown for scale-0 (58-pixel face) A candidate window is masked out (black) if fewer than 50% of its pixels were not skin-toned 76% of candidates were rejected at this scale 38
39 More Candidate Masks 39
40 Skin-tone Masking A bit of extra work: Classify each input pixel as skin-toned or not 5-10 math instructions in RGB or YUV Can be done in the same kernel as RGB->luminance conversion Compute integral image of pixel classes Use the integral image to reject candidates when creating the initial work-queue for detection Experimental data: TK1: No mask, no streams, no segments: ms Mask, no streams, no segments: 34.9 ms (~4x speedup, as expected) Mask, streams, no segments: 34.4 ms Mask, streams, segments: 34.0 ms K40: No mask, no streams, no segments: 9.8 ms Mask, no streams, no segments: 4.3 ms Mask, streams, no segments: 2.8 ms Mask, streams, segments: 2.1 ms (~2.3x speedup -> less than 4x expected indicates GPU is idling) 40
41 Improving Cache Behavior 41
42 Improving Cache Behavior Till now the integral image was 1921x1081 First scale (scale-0) is 2.44x: Training window is 24 pixels Smallest face of interest is 50 pixels, scaling factor is 1.25x Implies that a 787x443 integral image is sufficient ~6x smaller than original size Smaller image footprint can improve cache behavior In this case it s the read-only (aka read-only ) cache on the SM Reduces requests to L2 Lower latency Less bandwidth-pressure higher L2 hit-rate -> less traffic to DRAM 42
43 Empirical Data 16-stage MBLBP classifier 2 segments, concurrent scale processing TK1: Mask: 2.12x speedup ( 34 ms -> 16 ms) No mask: 2.33x speedup (126 ms -> 54 ms) K40: Mask: 1.27x speedup (2.1 ms -> 1.7 ms) No mask: 1.56x speedup (7.5 ms -> 4.8 ms) 43
44 Benefits of Downscaling Reduced requests to L2 by ~3x on both GPUs TK1 was being limited by L2 bandwidth: Before downscaling: 40-93% of L2 theory After downscaling: 28-74% K40 was sensitive to L2 bandwidth: Before downscaling: 12-70% of theory After downscaling: 5-35% K40 has 1.6x more L2 bandwidth/sm than TK1 Thus less sensitive to bandwidth for this application than TK1 Improved L2 hit-rate (lowered traffic to DRAM) TK1: from 5-55% to 54-98% K40: from 44-99% to 98-99% K40 has 12x more L2 than TK1 Thus able to achieve a higher hit-rate than TK1, reducing traffic to DRAM 44
45 Benefits of Downscaling Reduced requests to L2 by ~3x on both GPUs TK1 was being limited by L2 bandwidth: Before downscaling: 40-93% of L2 theory After downscaling: 28-74% K40 was sensitive to L2 bandwidth: Before downscaling: 12-70% of theory After downscaling: 5-35% K40 has 1.6x more L2 bandwidth/sm than TK1 Thus less sensitive to bandwidth for this application than TK1 Improved L2 hit-rate (lowered traffic to DRAM) TK1: from 5-55% to 54-98% K40: from 44-99% to 98-99% K40 has 12x more L2 than TK1 Thus able to achieve a higher hit-rate than TK1, reducing traffic to DRAM 45
46 Quick Summary We ve examined several ways to improve performance Breaking stages into segments: up to 1.3x Concurrent processing of scales: 1.2-2x Can be higher, depending on classifier and GPU Downscaling to the first scale first: x Masking (ROI): ~3x Depends on content and masking approach All of the above use the same exact kernel code Adjust only image or launch parameters Together improved cascade time: TK1: from 134 ms to 16 ms (8.4x speedup) K40: from 9.8 ms to 1.7 ms (5.8x speedup) Switching parallelization after a number of stages Potential further speedup of ~1.5x 46
47 Note on Feature Format 47
48 Feature Storage Format Many features are rectangle based Two approaches to storing features in memory: Geometry: coordinates/sizes within a window Pointers: Popular in OpenCV and other codes Compute pointers to the vertices of window 0 Window-0: the first window (top-left corner, for example) Vertices for window k are addressed by adding offset k to these pointers Pointer math per vertex: 64-bit multiply-add» A dependent sequence of 2+ instructions on GPU 48
49 MB-LBP Features Only one pattern: 3x3 tile of rectangles Pointers: Need 16 pointers: 128 B per feature 32 or more address instructions per window Geometry: 4 values: (x,y) of top-left corner, width, height 16 bytes per feature when storing ints could be as low as 4B when storing chars, but would require bitextraction instructions Address math: ~50 instructions 49
50 Haar-like Features 5 fundamental patterns (2 or 3 rectangles) Pointers: 6, 8, or 9 pointers: bytes per feature or more instructions per window Geometry: Several choices: Store each rectangle (2 or 3 per feature) Store vertices (would need 5 categories) When storing each rectangle 4 values: (x,y) of top-left corner, width, height All 4 values are relative to training window Usually 20x20 to 32x32 in size So, could store as few 4B (4 chars), 16 B if storing ints» 4 chars would require bit-extraction instructions 3x16B = 48 B per feature ~3x16 = 48 instructions per window 50
51 Pointers vs Geometry When processing multiple scales: Geometry places no requirements when processing multiple scales Pointers require one of: Compute pointers for each scale Scale image and compute integral for each scale Pointers also require one of: Additional buffer for the integral image Buffer to be reused by all images Compute pointers for each input image 51
52 MBLBP Performance Geometry was 3.5x faster than pointers Quick test: no segments, no streams, no mask All other numbers in this presentation were measured with geometry 52
53 Feature Multiples Sometimes the same feature is used in several stages Two choices: Have multiple copies of the feature in memory Simple array traversal Consumes more memory Add a level of indirection: Each feature is stored exactly once Maintain an array of indices Map weak-classifiers to unique features Approach implemented in OpenCV Preference for performance: store multiple copies, avoid indirection Indirection adds 100s-1000s cycles of latency, adds to bandwidth pressure as well Read the index from memory Use the index to read feature from memory Typically only a very small percentage of features are replicated Negligible impact on memory consumed 53
54 Summary Cascade performance for a 16-stage MBLBP classifier: TK1: 16.0 ms K40: 1.6 ms Can likely be improved further (these are without switchedparallelization) We looked at: How to parallelize cascaded classifiers How to reduce input to a cascade How to maximize cache performance for cascades How to store features in memory Performance impact of the above: Varies with classifier, detection parameters and GPU Good choices can lead to O(10) speedup over the naïve approach 54
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationFundamental Optimizations
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationSupercomputing, Tutorial S03 New Orleans, Nov 14, 2010
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationIdentifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011
Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationPerformance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them
Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them Paulius Micikevicius Developer Technology, NVIDIA Goals of this Talk Two-fold: Describe how hardware operates Show
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationMattan Erez. The University of Texas at Austin
EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationCUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017
CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 3 Unrealistic Effort/Reward Performance
More informationAdvanced CUDA Programming. Dr. Timo Stich
Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationCS/EE 217 Midterm. Question Possible Points Points Scored Total 100
CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationXIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture
XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics
More informationOptimizing DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most
More informationGPU-accelerated data expansion for the Marching Cubes algorithm
GPU-accelerated data expansion for the Marching Cubes algorithm San Jose (CA) September 23rd, 2010 Christopher Dyken, SINTEF Norway Gernot Ziegler, NVIDIA UK Agenda Motivation & Background Data Compaction
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION
April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU
More informationNVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010
Analysis-Driven Optimization Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Performance Optimization Process Use appropriate performance metric for each kernel For example,
More informationParallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)
Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationPerformance Optimization Process
Analysis-Driven Optimization (GTC 2010) Paulius Micikevicius NVIDIA Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for a bandwidth-bound
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationShadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions.
1 2 Shadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions. 3 We aim for something like the numbers of lights
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationUse cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games
Viewdle Inc. 1 Use cases Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games 2 Why OpenCL matter? OpenCL is going to bring such
More informationParallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)
Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationPlot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;
How will execution time grow with SIZE? int array[size]; int A = ; for (int i = ; i < ; i++) { for (int j = ; j < SIZE ; j++) { A += array[j]; } TIME } Plot SIZE Actual Data 45 4 5 5 Series 5 5 4 6 8 Memory
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationReal-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis
Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis Real-Time Reyes-Style Adaptive Surface Subdivision Anjul Patney and John D. Owens SIGGRAPH Asia
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationOptimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing for DirectX Graphics Richard Huddy European Developer Relations Manager Also on today from ATI... Start & End Time: 12:00pm 1:00pm Title: Precomputed Radiance Transfer and Spherical Harmonic
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationGeForce4. John Montrym Henry Moreton
GeForce4 John Montrym Henry Moreton 1 Architectural Drivers Programmability Parallelism Memory bandwidth 2 Recent History: GeForce 1&2 First integrated geometry engine & 4 pixels/clk Fixed-function transform,
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationBenchmarking the Memory Hierarchy of Modern GPUs
1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong
More informationCode Optimizations for High Performance GPU Computing
Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationProfiling & Tuning Applications. CUDA Course István Reguly
Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs
More informationScheduling Image Processing Pipelines
Lecture 15: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];
More informationSpring Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp
More informationReductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research
Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationPowerVR Hardware. Architecture Overview for Developers
Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
More informationCS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck
Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationOutline. Single GPU Implementation. Multi-GPU Implementation. 2-pass and 1-pass approaches Performance evaluation. Scalability on clusters
Implementing 3D Finite Difference Codes on the GPU Paulius Micikevicius NVIDIA Outline Single GPU Implementation 2-pass and 1-pass approaches Performance evaluation Multi-GPU Implementation Scalability
More informationStreaming Massive Environments From Zero to 200MPH
FORZA MOTORSPORT From Zero to 200MPH Chris Tector (Software Architect Turn 10 Studios) Turn 10 Internal studio at Microsoft Game Studios - we make Forza Motorsport Around 70 full time staff 2 Why am I
More informationGPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group
GPU Accelerated Application Performance Tuning Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group Main Requirements for GPU Performance Expose sufficient parallelism Use
More informationAbout Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.
About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS. Phoenix FD core SIMULATION & RENDERING. SIMULATION CORE - GRID-BASED
More informationDesign guidelines for embedded real time face detection application
Design guidelines for embedded real time face detection application White paper for Embedded Vision Alliance By Eldad Melamed Much like the human visual system, embedded computer vision systems perform
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationHIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1
April 4-7, 2016 Silicon Valley HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv, NVIDIA Brant Zhao, NVIDIA April 7 mlv@nvidia.com https://github.com/madeye Histogram of Oriented Gradients on GPU
More informationScheduling Image Processing Pipelines
Lecture 14: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];
More informationIntroduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem
Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationRegister File Organization
Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities
More informationHere s the general problem we want to solve efficiently: Given a light and a set of pixels in view space, resolve occlusion between each pixel and
1 Here s the general problem we want to solve efficiently: Given a light and a set of pixels in view space, resolve occlusion between each pixel and the light. 2 To visualize this problem, consider the
More informationAgenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File
EE 260: Introduction to Digital Design Technology Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa 2 Technology Naive Register File Write Read clk Decoder Read Write 3 4 Arrays:
More informationHardware-driven Visibility Culling Jeong Hyun Kim
Hardware-driven Visibility Culling Jeong Hyun Kim KAIST (Korea Advanced Institute of Science and Technology) Contents Introduction Background Clipping Culling Z-max (Z-min) Filter Programmable culling
More informationOperating Systems Design Exam 2 Review: Spring 2011
Operating Systems Design Exam 2 Review: Spring 2011 Paul Krzyzanowski pxk@cs.rutgers.edu 1 Question 1 CPU utilization tends to be lower when: a. There are more processes in memory. b. There are fewer processes
More informationMemory Hierarchies &
Memory Hierarchies & Cache Memory CSE 410, Spring 2009 Computer Systems http://www.cs.washington.edu/410 4/26/2009 cse410-13-cache 2006-09 Perkins, DW Johnson and University of Washington 1 Reading and
More informationGUERRILLA DEVELOP CONFERENCE JULY 07 BRIGHTON
Deferred Rendering in Killzone 2 Michal Valient Senior Programmer, Guerrilla Talk Outline Forward & Deferred Rendering Overview G-Buffer Layout Shader Creation Deferred Rendering in Detail Rendering Passes
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationOptimizing CUDA for GPU Architecture. CSInParallel Project
Optimizing CUDA for GPU Architecture CSInParallel Project August 13, 2014 CONTENTS 1 CUDA Architecture 2 1.1 Physical Architecture........................................... 2 1.2 Virtual Architecture...........................................
More informationLecture 16. Today: Start looking into memory hierarchy Cache$! Yay!
Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1
More informationCS 416: Opera-ng Systems Design March 23, 2012
Question 1 Operating Systems Design Exam 2 Review: Spring 2011 Paul Krzyzanowski pxk@cs.rutgers.edu CPU utilization tends to be lower when: a. There are more processes in memory. b. There are fewer processes
More informationComputer Architecture Spring 2016
omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,
More informationCS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28
CS 220: Introduction to Parallel Computing Introduction to CUDA Lecture 28 Today s Schedule Project 4 Read-Write Locks Introduction to CUDA 5/2/18 CS 220: Parallel Computing 2 Today s Schedule Project
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationPersistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL
(stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s
More informationCS 677: Parallel Programming for Many-core Processors Lecture 6
1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11
More informationOutline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis
Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20
More informationECE 498AL. Lecture 12: Application Lessons. When the tires hit the road
ECE 498AL Lecture 12: Application Lessons When the tires hit the road 1 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different
More informationAnalysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth
Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared
More informationRecall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory
Recall: Address Space Map 13: Memory Management Biggest Virtual Address Stack (Space for local variables etc. For each nested procedure call) Sometimes Reserved for OS Stack Pointer Last Modified: 6/21/2004
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More information