A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

Similar documents
Lecture 8: GPU Programming. CSE599G1: Spring 2017

Portland State University ECE 588/688. Graphics Processors

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

ME964 High Performance Computing for Engineering Applications

Inter-Block GPU Communication via Fast Barrier Synchronization

B. Tech. Project Second Stage Report on

CUDA Architecture & Programming Model

Dense matching GPU implementation

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

GRAPHICS PROCESSING UNITS

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

GPU Fundamentals Jeff Larkin November 14, 2016

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Practical Introduction to CUDA and GPU

Characterization of Speech Recognition Systems on GPU Architectures

Parallelising Pipelined Wavefront Computations on the GPU

Threading Hardware in G80

Applications of Berkeley s Dwarfs on Nvidia GPUs

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Double-Precision Matrix Multiply on CUDA

Dense Linear Algebra. HPC - Algorithms and Applications

Master Informatics Eng.

Programming in CUDA. Malik M Khan

Introduction to CUDA (1 of n*)

Lecture 2: CUDA Programming

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices

CS 179 Lecture 4. GPU Compute Architecture

Understanding Outstanding Memory Request Handling Resources in GPGPUs

CUDA Programming Model

CUDA (Compute Unified Device Architecture)

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Accelerating CFD with Graphics Hardware

HPVM: Heterogeneous Parallel Virtual Machine

LDetector: A low overhead data race detector for GPU programs

Hierarchical Belief Propagation To Reduce Search Space Using CUDA for Stereo and Motion Estimation

CME 213 S PRING Eric Darve

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Multi-Processors and GPU

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

CUDA C Programming Mark Harris NVIDIA Corporation

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

high performance medical reconstruction using stream programming paradigms

Warps and Reduction Algorithms

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction to GPU hardware and to CUDA

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU

Parallel Numerical Algorithms

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Tuning CUDA Applications for Fermi. Version 1.2

Optimization solutions for the segmented sum algorithmic function

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Advanced OpenACC. Steve Abbott November 17, 2017

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Using GPUs to compute the multilevel summation of electrostatic forces

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors

Semantic Segmentation

Mathematical computations with GPUs

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

GPU programming. Dr. Bernhard Kainz

Exotic Methods in Parallel Computing [GPU Computing]

Parallel Programming Concepts. GPU Computing with OpenCL

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

ME964 High Performance Computing for Engineering Applications

Data Parallel Execution Model

1/25/12. Administrative

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

Use of Synthetic Benchmarks for Machine- Learning-based Performance Auto-tuning

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

EFFICIENT SYNCHRONIZATION FOR GPGPU. by Jiwei Liu B.S., Zhejiang University, 2010

CUDA Threads. Origins. ! The CPU processing core 5/4/11

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Tiled Matrix Multiplication

Module Memory and Data Locality

Tesla Architecture, CUDA and Optimization Strategies

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Performance potential for simulating spin models on GPU

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Numerical Simulation on the GPU

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

CUB. collective software primitives. Duane Merrill. NVIDIA Research

GPU Computing with CUDA. Part 2: CUDA Introduction

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

Transcription:

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

BP-M AND TILED-BP 2

BP-M 3

Tiled BP T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 4

Tiled BP Reading boundary messages from memory Local computation on local data Writing the resulting boundary Local BP-M messages to memory 5

Tiled BP 6

BACKGROUND ON GPU 7

GPU Programming Model (host) kernel launch <<<GridDim,BlockDim>>> (args) device Block Grid (0,0) (0,0) (0,0)... Block (0,0) Block (0,1)... (0,0) (0,0) Block (1,0) Block (1,1) (0,0)...... Number of threads and thread blocks is specified at kernel launch All threads execute the same kernel function 8

GPU Memory Model Grid Block(0,0) shared memory (0,0) (1,0)... (0,1) (1,1)... Block(0,0) shared memory (0,0) (1,0)... (0,1) (1,1)... Global Memory: accessible by all threads Shared Memory: scratchpad memory, shared by threads within a thread block Other components of the memory hierarchy are not shown (registers, constant memory, caches) global memory s within a thread block are cooperative and can synchronize. syncthreads() 9

Kernel Execution kernel Block Block... Block SM 0 Schedule thread blocks in Streaming Multiprocessors (SMs) GPU... SM n blocks are assigned to SMs. SMs contain simple processors with deep pipelines (throughputoriented architecture) An SM can accommodate multiple thread blocks simultaneously. The exact number depends on hardware restrictions. A thread block resides in an SM until its execution is completed. device memory 10

OUR METHOD AND EVALUATION 11

Tiled BP on GPU A thread block for each tile 12

Tiled BP on GPU A Big Picture Barrier Synchronization 13

Tiled BP on GPU Finer Granularity One thread per message vector element Looking into a Tile Different Groups of threads in a thread block 14

Tiled BP on GPU Finer Granularity Looking into a Tile The same for Up and Down 15

Optimization 1 Shared Memory Load data to shared memory at the start of local BP-M for each tile Boundary messages Data vectors of pixels Set reserved space in shared memory for other data in computation Internal messages vectors Outgoing boundary messages 16

Optimization 1 Shared Memory All data loading is coalesced Row-wise storage of vertical and horizontal boundary messages for memory coalescing Tile at most 13 by 13 to accommodate all data in (48 KB) shared memory At most 13x16=208 threads in a thread block 17

Optimization 1 Shared Memory With maximum tile size all shared memory storage is used for one thread block Given that, each SM can accommodate just one thread block Underutilizing SMs, but suitable for interblock barrier synchronization 18

Optimization 2 Fast Global Barrier State-of-the-art GPU global barrier 1 Requirements No need to launch multiple kernels, significantly reducing One thread block per SM Number of thread overheads blocks at kernel launch equal to number of SMs Manual scheduling of thread blocks on tiles 1 S. Xiao and W.-c. Feng, Inter-block gpu communication via fast barrier synchronization, in IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010 19

Other Optimizations Fast and parallelized message calculation 1 Manual analysis and tuning of the code Removing some of syncthreads instructions 1 C.-C. Cheng, C.-K. Liang, Y.-C. Lai, H. H. Chen, and L.-G. Chen, Fast belief propagation process element for high-quality stereo estimation, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009), IEEE, 2009, pp. 745 20 748.

Evaluation Algorithm Hardware Price (USD) Exec. Time (ms) Tsukuba Accuracy 4 Exec. Time (ms) Judging Test Accuracy 4 BP-M 1 CPU $300 39,802 79.8 39,767 86.5 TiledBP 2 CPU $300 1,585.85 82.1 1,586.75 80.9 TiledBP TiledBP GPU NVIDIA $500 9.29 82.1 GTX 680 3 GPU NVIDIA $1350 7.96 82.1 Tesla C2050 3 9.24 80.9 115.0 83.8 7.95 80.9 90.7 83.8 1 Given reference code on Intel Xeon E5-1620 @ 3.60GHz 2 TiledBP CPU implementation on Intel Xeon E5-1620 @ 3.60GHz 3 GTX 680 with 8 SMs and Tesla C2050 with 14 SMs 4 Percentage of accurate depth labels compared to ground truth 21

Conclusion New GPU implementation of tiledbp for stereo matching Wavefront computation Inter-block GPU barrier synchronization Evaluation Comparable accuracy Comparable price 200X speed up compared to CPU tiledbp 22

BACK-UP SLIDES 23

Optimization 2 Fast Global Barrier One thread block per SM Number of thread blocks at kernel launch equal to number of SMs Manual scheduling of thread blocks on tiles Code snippet from: S. Xiao and W.-c. Feng, Inter-block gpu communication via fast barrier synchronization, in IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010 24