Maximizing Face Detection Performance

Size: px
Start display at page:

Download "Maximizing Face Detection Performance"

Transcription

1 Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC

2 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount of work Improving cache behavior Note on feature format The points made apply to any cascaded classifier Face detection is just one example 2

3 Quick Review Slide a window around the image Use weak classifiers to detect object presence in each position I ll call a position a candidate Think of all the (x,y) positions that could be upper-left corners of a candidate window Each candidate is independent of all others -> easy opportunity to parallelize Cascade of weak-classifiers per candidate Some number of stages are cascaded Decision to continue/abort is made after each stage Each stage contains a number of weak-classifiers Evaluate some feature on the window, add its result to the running-stage sum Do this at multiple scales Classifiers are trained on small windows (~20x20 pixels) To detect objects of different sizes, do one of: Adjust the size of candidate windows (and scale features) Adjust (scale) image to match training window-size Group the candidates that passed the entire cascade 3

4 Input Image 4

5 Candidates that Pass All Stages 5

6 Candidates After Grouping 6

7 OpenCV haarcascade_frontalface_alt2.xml 20 stages 1047 weak-classifiers 2094 Haar-like features Each weak classifier is a 2- feature tree 4535 rectangles 1747 features contain 2 rects 347 features have 3 rects Idea is to reject more and more negatives with successive stages, passing through the positives Earlier stages are simpler for perf reasons Quickly reject negatives, reducing work for subsequent stages False-positives are OK, false-negatives are not OK 7

8 MBLBP Classifier 16 stages 451 features 4059 rects 419 unique features 8

9 Parallelization Ample opportunity for parallelization Scales are independent of each other Each scale has a (large) number of candidates, all independent A number of choices to be made: Number of threads per candidate window One or multiple threads per candidate Cascade stage processing All stages in a single or multiple kernel launches Scale processing In sequence (single stream) or concurrent (multiple streams) 9

10 Parallelization Ample opportunity for parallelization Scales are independent of each other Each scale has a (large) number of candidates, all independent A number of choices to be made: Number of threads per candidate window One or The multiple combination threads per of candidate choices can be Cascade stage overwhelming, processing so it helps to get some All stages intuition a single for or the multiple algorithm kernel launches operation Scale processing In sequence (single stream) or concurrent (multiple streams) 10

11 Input Image 11

12 Lighter = Candidate Passed More Stages 12

13 Lighter = Candidate Passed More Stages 13

14 Candidates Passing Stages 1920x1080 input image 5 scales: pixel faces 1.25x scaling factor Process each candidate Start with 478K candidates 254 pass all stages 14

15 Observations Adjacent candidates can pass very different number of stages Different amount of work for adjacent candidates The amount of candidates remaining decreases with the number of stages Often each stage rejects ~50% of candidates Depends on training parameters, etc. 15

16 Parallelization Choices 16

17 Chosen Parallelization One thread per candidate A thread iterates through the stages, deciding whether to continue after each stage Loop through the weak-classifiers for each stage Simple port: kernel code nearly identical to CPU code CPU-only code iterates through the candidates ( slides the window ) GPU code launches a thread for each candidate GPU kernel code = CPU loop body Two challenges: Different workloads per candidate (thus per thread) Having enough work to saturate a GPU 17

18 Challenge: Different Workloads GPU execution refresher: Threads are grouped into threadblocks Resources (thread IDs, registers, SMEM) are released only when all the threads in a block terminate Instructions are executed per warp (SIMT) 32 consecutive threads issue the same instruction Different code paths are allowed, threads get masked out during the path they don t take What these mean to cascades: If at least one thread in a warp needs to evaluate a stage, all 32 threads take the same time Inactive threads waste math pipelines If at least one thread in a threadblock needs to continue evaluating, the resources of all other threads in that block are not released Prevent new threads from starting right away 18

19 Instructions, time Challenge: Different Workloads GPU execution refresher: Threads are grouped into threadblocks Resources (thread IDs, registers, SMEM) are released only when all the threads in a block terminate Instructions are executed per warp (SIMT) 32 consecutive threads issue the same instruction Different code paths are allowed, threads get masked out during the path they don t take What these mean to cascades: If at least one thread in a warp needs to evaluate a stage, all 32 threads take the same time Inactive threads waste math pipelines If at least one thread in a threadblock needs to continue evaluating, the resources of all other threads in that block are not released Prevent new threads from starting right away 0 19

20 Challenge: Different Workloads GPU execution refresher: Threads are grouped into threadblocks Resources (thread IDs, registers, SMEM) are released only when all the threads in a block terminate Instructions are executed per warp (SIMT) 32 consecutive threads issue the same instruction Different code paths are allowed, threads get masked out during the path they don t take What these mean to cascades: If at least one thread in a warp needs to evaluate a stage, all 32 threads go through evaluation instructions Inactive threads waste math pipelines If at least one thread in a threadblock needs to continue evaluating, the resources of all other threads in that block are not released Prevent new threads from starting right away 20

21 Stage Processing Threads decide whether to terminate after each stage Could process all stages with a single kernel launch Potentially wasting the math and resources Could break stages into segments (work compaction ) A sequence of kernel launches, one per segment Maintain a work-queue Launch only as many threads as there are candidates in the queue At the end of each segment append the live candidates to the queue Use atomics for updating the index Work-queue maintenance adds some overhead Read/write queues (writes are atomic) Communicate queue size to CPU for subsequent launch 21

22 Stage Processing: Timing Results 20-stage classifier, TK1 1 segment: 127 ms (1-20 stages) 2 segments: 93 ms (1-3, 4-20 stages) 3 segments: 84 ms (1-3, 4-7, 8-20 stages) 16-stage classifier: 1 segment: 134 ms 2 segments: 126 ms (1-2, 3-16 stages) K40: 9.8 ms, 8.7 ms 22

23 Why I Didn t Choose SMEM Here SMEM could be used to store the integral image tile needed by a threadblock, but: SMEM makes scaling features impractical SMEM overhead becomes prohibitive, forcing us to scale images SMEM precludes work-compaction: A threadblock must cover a contiguous region to read all the inputs Preliminary test with another classifier showed very small difference between using SMEM or just reading via texture cache And the texture code was still scaling image (could have been avoided) Can use either texture functions, or ldg() with regular pointers Caution: the evidence isn t conclusive yet Classifiers that benefit little from compaction may benefit from SMEM 23

24 Why I Didn t Choose SMEM Here SMEM could be used to store the integral image tile needed by a threadblock, but: SMEM makes scaling features impractical SMEM overhead becomes prohibitive, forcing us to scale images SMEM precludes work-compaction: A threadblock must cover a contiguous region to read all the inputs Preliminary test with another classifier showed very small difference between using SMEM or just reading via texture cache And the texture code was still scaling image (could have been avoided) Can use either texture functions, or ldg() with regular pointers Caution: the evidence isn t conclusive yet Classifiers that benefit little from compaction may benefit from SMEM 24

25 Why I Didn t Choose SMEM Here SMEM could be used to store the integral image tile needed by a threadblock, but: SMEM makes scaling features impractical SMEM overhead becomes prohibitive, forcing us to scale images SMEM precludes work-compaction: A threadblock must cover a contiguous region to read all the inputs Preliminary test with another classifier showed very small difference between using SMEM or just reading via texture cache And the texture code was still scaling image (could have been avoided) Can use either texture functions, or ldg() with regular pointers Caution: the evidence isn t conclusive yet Classifiers that benefit little from compaction may benefit from SMEM 25

26 Challenge: Enough Work to Saturate a GPU We start out with 100s of thousands of candidates Plenty to saturate even the biggest GPUs We are left with fewer and fewer candidates as stages reject them Even 1-SM GPUs (TK1) will start idling Bigger GPUs will start idling sooner Two solutions: Process scales concurrently Switch parallelization after some number of stages 26

27 Challenge: Enough Work to Saturate a GPU We start out with 100s of thousands of candidates Plenty to saturate even the biggest GPUs We are left with fewer and fewer candidates as stages reject them Even 1-SM GPUs (TK1) will start idling Bigger GPUs will start idling sooner Two solutions: Process scales concurrently Switch parallelization after some number of stages 27

28 Challenge: Enough Work to Saturate a GPU We start out with 100s of thousands of candidates Plenty to saturate even the biggest GPUs We are left with fewer and fewer candidates as stages reject them Even 1-SM GPUs (TK1) will start idling Bigger GPUs will start idling sooner Two solutions: Process scales concurrently Switch parallelization after some number of stages 28

29 Concurrent Scale Processing Issue kernels for different scales into different streams Scales are independent Maintain a different work-queue for each scale So that features can be properly scaled Orthogonal to work-compaction: Loop through the segments For each segment launch as many kernels as you have scales GPU stream support in hw: TK1 supports 4 streams Other GPUs (Kepler and more recent) support 32 streams More streams can be used in sw, but will result in stream aliasing 29

30 TK1: 16-stage MBLBP Classifier Sequential Scale-0 Scale-1 Scale-2 Scale-3 Scale-4 Concurrent Concurrent 2 segments 30

31 K40: 16-stage MBLBP Classifier 9.7 ms 8.3 ms 2-segments 6.1 ms 31

32 TK1: 78.9 ms 20-stage Haar-like 5 scales of stages scales of stages 4-7 stages 8-20 K40: 7.2 ms 32

33 Switching Parallelizations One thread per candidate: Pro: candidates go through minimal stage count Con: GPU becomes latency limited After a number of stages there isn t enough work to hide latency Very rough rule of thumb: fewer than 512 threads per SM GPU becomes underutilized Alternative parallelization: Use multiple threads per candidate, say a warp A warp evaluates 32 features in parallel Performs a reduction (or prefix sum) to compute stage sum Power-of-2 up to a warp is nice because of the shfl/vote instructions May do unnecessary work A thread evaluates a feature it wouldn t have reached sequentially 33

34 Switching Parallelizations Idea: change parallelization when only a few 100 candidates remain Prior to that continue to use 1 thread/candidate Avoids inter-thread communication and unnecessary work Preliminary work on a different classifier: A few 100 features Speedup: K40: x (depending on image) TK1: 1.0x Results suggest that: Alternative parallelization helps when you have lots of stages with too few candidates to saturate the GPU Confirmed when TK1 ran a classifier with even more stages 34

35 Work Reduction 35

36 Reduce the Initial Number of Candidates Less work -> less time Will reach the point of non-saturated GPU sooner Makes concurrent scale processing even more useful Two ways to reduce the initial candidate count: Use a mask to not consider some candidates ROI, skin-tone, etc. Skip candidates (stride > 1) Post-process neighborhoods of rectangles that didn t get grouped 36

37 Skin Tone Mask Race invariant, simply needs a white-balanced camera Color density plots for asian, african, and caucasian skin from 37

38 Candidate Mask Mask pixel at (x,y) corresponds to upper-left corner of a candidate window Shown for scale-0 (58-pixel face) A candidate window is masked out (black) if fewer than 50% of its pixels were not skin-toned 76% of candidates were rejected at this scale 38

39 More Candidate Masks 39

40 Skin-tone Masking A bit of extra work: Classify each input pixel as skin-toned or not 5-10 math instructions in RGB or YUV Can be done in the same kernel as RGB->luminance conversion Compute integral image of pixel classes Use the integral image to reject candidates when creating the initial work-queue for detection Experimental data: TK1: No mask, no streams, no segments: ms Mask, no streams, no segments: 34.9 ms (~4x speedup, as expected) Mask, streams, no segments: 34.4 ms Mask, streams, segments: 34.0 ms K40: No mask, no streams, no segments: 9.8 ms Mask, no streams, no segments: 4.3 ms Mask, streams, no segments: 2.8 ms Mask, streams, segments: 2.1 ms (~2.3x speedup -> less than 4x expected indicates GPU is idling) 40

41 Improving Cache Behavior 41

42 Improving Cache Behavior Till now the integral image was 1921x1081 First scale (scale-0) is 2.44x: Training window is 24 pixels Smallest face of interest is 50 pixels, scaling factor is 1.25x Implies that a 787x443 integral image is sufficient ~6x smaller than original size Smaller image footprint can improve cache behavior In this case it s the read-only (aka read-only ) cache on the SM Reduces requests to L2 Lower latency Less bandwidth-pressure higher L2 hit-rate -> less traffic to DRAM 42

43 Empirical Data 16-stage MBLBP classifier 2 segments, concurrent scale processing TK1: Mask: 2.12x speedup ( 34 ms -> 16 ms) No mask: 2.33x speedup (126 ms -> 54 ms) K40: Mask: 1.27x speedup (2.1 ms -> 1.7 ms) No mask: 1.56x speedup (7.5 ms -> 4.8 ms) 43

44 Benefits of Downscaling Reduced requests to L2 by ~3x on both GPUs TK1 was being limited by L2 bandwidth: Before downscaling: 40-93% of L2 theory After downscaling: 28-74% K40 was sensitive to L2 bandwidth: Before downscaling: 12-70% of theory After downscaling: 5-35% K40 has 1.6x more L2 bandwidth/sm than TK1 Thus less sensitive to bandwidth for this application than TK1 Improved L2 hit-rate (lowered traffic to DRAM) TK1: from 5-55% to 54-98% K40: from 44-99% to 98-99% K40 has 12x more L2 than TK1 Thus able to achieve a higher hit-rate than TK1, reducing traffic to DRAM 44

45 Benefits of Downscaling Reduced requests to L2 by ~3x on both GPUs TK1 was being limited by L2 bandwidth: Before downscaling: 40-93% of L2 theory After downscaling: 28-74% K40 was sensitive to L2 bandwidth: Before downscaling: 12-70% of theory After downscaling: 5-35% K40 has 1.6x more L2 bandwidth/sm than TK1 Thus less sensitive to bandwidth for this application than TK1 Improved L2 hit-rate (lowered traffic to DRAM) TK1: from 5-55% to 54-98% K40: from 44-99% to 98-99% K40 has 12x more L2 than TK1 Thus able to achieve a higher hit-rate than TK1, reducing traffic to DRAM 45

46 Quick Summary We ve examined several ways to improve performance Breaking stages into segments: up to 1.3x Concurrent processing of scales: 1.2-2x Can be higher, depending on classifier and GPU Downscaling to the first scale first: x Masking (ROI): ~3x Depends on content and masking approach All of the above use the same exact kernel code Adjust only image or launch parameters Together improved cascade time: TK1: from 134 ms to 16 ms (8.4x speedup) K40: from 9.8 ms to 1.7 ms (5.8x speedup) Switching parallelization after a number of stages Potential further speedup of ~1.5x 46

47 Note on Feature Format 47

48 Feature Storage Format Many features are rectangle based Two approaches to storing features in memory: Geometry: coordinates/sizes within a window Pointers: Popular in OpenCV and other codes Compute pointers to the vertices of window 0 Window-0: the first window (top-left corner, for example) Vertices for window k are addressed by adding offset k to these pointers Pointer math per vertex: 64-bit multiply-add» A dependent sequence of 2+ instructions on GPU 48

49 MB-LBP Features Only one pattern: 3x3 tile of rectangles Pointers: Need 16 pointers: 128 B per feature 32 or more address instructions per window Geometry: 4 values: (x,y) of top-left corner, width, height 16 bytes per feature when storing ints could be as low as 4B when storing chars, but would require bitextraction instructions Address math: ~50 instructions 49

50 Haar-like Features 5 fundamental patterns (2 or 3 rectangles) Pointers: 6, 8, or 9 pointers: bytes per feature or more instructions per window Geometry: Several choices: Store each rectangle (2 or 3 per feature) Store vertices (would need 5 categories) When storing each rectangle 4 values: (x,y) of top-left corner, width, height All 4 values are relative to training window Usually 20x20 to 32x32 in size So, could store as few 4B (4 chars), 16 B if storing ints» 4 chars would require bit-extraction instructions 3x16B = 48 B per feature ~3x16 = 48 instructions per window 50

51 Pointers vs Geometry When processing multiple scales: Geometry places no requirements when processing multiple scales Pointers require one of: Compute pointers for each scale Scale image and compute integral for each scale Pointers also require one of: Additional buffer for the integral image Buffer to be reused by all images Compute pointers for each input image 51

52 MBLBP Performance Geometry was 3.5x faster than pointers Quick test: no segments, no streams, no mask All other numbers in this presentation were measured with geometry 52

53 Feature Multiples Sometimes the same feature is used in several stages Two choices: Have multiple copies of the feature in memory Simple array traversal Consumes more memory Add a level of indirection: Each feature is stored exactly once Maintain an array of indices Map weak-classifiers to unique features Approach implemented in OpenCV Preference for performance: store multiple copies, avoid indirection Indirection adds 100s-1000s cycles of latency, adds to bandwidth pressure as well Read the index from memory Use the index to read feature from memory Typically only a very small percentage of features are replicated Negligible impact on memory consumed 53

54 Summary Cascade performance for a 16-stage MBLBP classifier: TK1: 16.0 ms K40: 1.6 ms Can likely be improved further (these are without switchedparallelization) We looked at: How to parallelize cascaded classifiers How to reduce input to a cascade How to maximize cache performance for cascades How to store features in memory Performance impact of the above: Varies with classifier, detection parameters and GPU Good choices can lead to O(10) speedup over the naïve approach 54

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them

Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them Paulius Micikevicius Developer Technology, NVIDIA Goals of this Talk Two-fold: Describe how hardware operates Show

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017 CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 3 Unrealistic Effort/Reward Performance

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics

More information

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most

More information

GPU-accelerated data expansion for the Marching Cubes algorithm

GPU-accelerated data expansion for the Marching Cubes algorithm GPU-accelerated data expansion for the Marching Cubes algorithm San Jose (CA) September 23rd, 2010 Christopher Dyken, SINTEF Norway Gernot Ziegler, NVIDIA UK Agenda Motivation & Background Data Compaction

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU

More information

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Analysis-Driven Optimization Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Performance Optimization Process Use appropriate performance metric for each kernel For example,

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Performance Optimization Process

Performance Optimization Process Analysis-Driven Optimization (GTC 2010) Paulius Micikevicius NVIDIA Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for a bandwidth-bound

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Shadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions.

Shadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions. 1 2 Shadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions. 3 We aim for something like the numbers of lights

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games Viewdle Inc. 1 Use cases Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games 2 Why OpenCL matter? OpenCL is going to bring such

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0; How will execution time grow with SIZE? int array[size]; int A = ; for (int i = ; i < ; i++) { for (int j = ; j < SIZE ; j++) { A += array[j]; } TIME } Plot SIZE Actual Data 45 4 5 5 Series 5 5 4 6 8 Memory

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis Real-Time Reyes-Style Adaptive Surface Subdivision Anjul Patney and John D. Owens SIGGRAPH Asia

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager Optimizing for DirectX Graphics Richard Huddy European Developer Relations Manager Also on today from ATI... Start & End Time: 12:00pm 1:00pm Title: Precomputed Radiance Transfer and Spherical Harmonic

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

GeForce4. John Montrym Henry Moreton

GeForce4. John Montrym Henry Moreton GeForce4 John Montrym Henry Moreton 1 Architectural Drivers Programmability Parallelism Memory bandwidth 2 Recent History: GeForce 1&2 First integrated geometry engine & 4 pixels/clk Fixed-function transform,

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Memory Hierarchy of Modern GPUs 1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong

More information

Code Optimizations for High Performance GPU Computing

Code Optimizations for High Performance GPU Computing Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Profiling & Tuning Applications. CUDA Course István Reguly

Profiling & Tuning Applications. CUDA Course István Reguly Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs

More information

Scheduling Image Processing Pipelines

Scheduling Image Processing Pipelines Lecture 15: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

Outline. Single GPU Implementation. Multi-GPU Implementation. 2-pass and 1-pass approaches Performance evaluation. Scalability on clusters

Outline. Single GPU Implementation. Multi-GPU Implementation. 2-pass and 1-pass approaches Performance evaluation. Scalability on clusters Implementing 3D Finite Difference Codes on the GPU Paulius Micikevicius NVIDIA Outline Single GPU Implementation 2-pass and 1-pass approaches Performance evaluation Multi-GPU Implementation Scalability

More information

Streaming Massive Environments From Zero to 200MPH

Streaming Massive Environments From Zero to 200MPH FORZA MOTORSPORT From Zero to 200MPH Chris Tector (Software Architect Turn 10 Studios) Turn 10 Internal studio at Microsoft Game Studios - we make Forza Motorsport Around 70 full time staff 2 Why am I

More information

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group GPU Accelerated Application Performance Tuning Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group Main Requirements for GPU Performance Expose sufficient parallelism Use

More information

About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.

About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS. About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS. Phoenix FD core SIMULATION & RENDERING. SIMULATION CORE - GRID-BASED

More information

Design guidelines for embedded real time face detection application

Design guidelines for embedded real time face detection application Design guidelines for embedded real time face detection application White paper for Embedded Vision Alliance By Eldad Melamed Much like the human visual system, embedded computer vision systems perform

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 April 4-7, 2016 Silicon Valley HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv, NVIDIA Brant Zhao, NVIDIA April 7 mlv@nvidia.com https://github.com/madeye Histogram of Oriented Gradients on GPU

More information

Scheduling Image Processing Pipelines

Scheduling Image Processing Pipelines Lecture 14: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];

More information

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Register File Organization

Register File Organization Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities

More information

Here s the general problem we want to solve efficiently: Given a light and a set of pixels in view space, resolve occlusion between each pixel and

Here s the general problem we want to solve efficiently: Given a light and a set of pixels in view space, resolve occlusion between each pixel and 1 Here s the general problem we want to solve efficiently: Given a light and a set of pixels in view space, resolve occlusion between each pixel and the light. 2 To visualize this problem, consider the

More information

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File EE 260: Introduction to Digital Design Technology Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa 2 Technology Naive Register File Write Read clk Decoder Read Write 3 4 Arrays:

More information

Hardware-driven Visibility Culling Jeong Hyun Kim

Hardware-driven Visibility Culling Jeong Hyun Kim Hardware-driven Visibility Culling Jeong Hyun Kim KAIST (Korea Advanced Institute of Science and Technology) Contents Introduction Background Clipping Culling Z-max (Z-min) Filter Programmable culling

More information

Operating Systems Design Exam 2 Review: Spring 2011

Operating Systems Design Exam 2 Review: Spring 2011 Operating Systems Design Exam 2 Review: Spring 2011 Paul Krzyzanowski pxk@cs.rutgers.edu 1 Question 1 CPU utilization tends to be lower when: a. There are more processes in memory. b. There are fewer processes

More information

Memory Hierarchies &

Memory Hierarchies & Memory Hierarchies & Cache Memory CSE 410, Spring 2009 Computer Systems http://www.cs.washington.edu/410 4/26/2009 cse410-13-cache 2006-09 Perkins, DW Johnson and University of Washington 1 Reading and

More information

GUERRILLA DEVELOP CONFERENCE JULY 07 BRIGHTON

GUERRILLA DEVELOP CONFERENCE JULY 07 BRIGHTON Deferred Rendering in Killzone 2 Michal Valient Senior Programmer, Guerrilla Talk Outline Forward & Deferred Rendering Overview G-Buffer Layout Shader Creation Deferred Rendering in Detail Rendering Passes

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Optimizing CUDA for GPU Architecture. CSInParallel Project

Optimizing CUDA for GPU Architecture. CSInParallel Project Optimizing CUDA for GPU Architecture CSInParallel Project August 13, 2014 CONTENTS 1 CUDA Architecture 2 1.1 Physical Architecture........................................... 2 1.2 Virtual Architecture...........................................

More information

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay! Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1

More information

CS 416: Opera-ng Systems Design March 23, 2012

CS 416: Opera-ng Systems Design March 23, 2012 Question 1 Operating Systems Design Exam 2 Review: Spring 2011 Paul Krzyzanowski pxk@cs.rutgers.edu CPU utilization tends to be lower when: a. There are more processes in memory. b. There are fewer processes

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,

More information

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28 CS 220: Introduction to Parallel Computing Introduction to CUDA Lecture 28 Today s Schedule Project 4 Read-Write Locks Introduction to CUDA 5/2/18 CS 220: Parallel Computing 2 Today s Schedule Project

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL (stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s

More information

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS 677: Parallel Programming for Many-core Processors Lecture 6 1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11

More information

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20

More information

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road ECE 498AL Lecture 12: Application Lessons When the tires hit the road 1 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different

More information

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared

More information

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory Recall: Address Space Map 13: Memory Management Biggest Virtual Address Stack (Space for local variables etc. For each nested procedure call) Sometimes Reserved for OS Stack Pointer Last Modified: 6/21/2004

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information