A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

Size: px

Start display at page:

Download "A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou"

Darrell Charles
5 years ago
Views:

1 A GPU Implementation of Tiled Belief Propagation on Markov Random Fields Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

2 BP-M AND TILED-BP 2

3 BP-M 3

4 Tiled BP T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 4

5 Tiled BP Reading boundary messages from memory Local computation on local data Writing the resulting boundary Local BP-M messages to memory 5

6 Tiled BP 6

7 BACKGROUND ON GPU 7

8 GPU Programming Model (host) kernel launch <<<GridDim,BlockDim>>> (args) device Block Grid (0,0) (0,0) (0,0)... Block (0,0) Block (0,1)... (0,0) (0,0) Block (1,0) Block (1,1) (0,0) Number of threads and thread blocks is specified at kernel launch All threads execute the same kernel function 8

9 GPU Memory Model Grid Block(0,0) shared memory (0,0) (1,0)... (0,1) (1,1)... Block(0,0) shared memory (0,0) (1,0)... (0,1) (1,1)... Global Memory: accessible by all threads Shared Memory: scratchpad memory, shared by threads within a thread block Other components of the memory hierarchy are not shown (registers, constant memory, caches) global memory s within a thread block are cooperative and can synchronize. syncthreads() 9

10 Kernel Execution kernel Block Block... Block SM 0 Schedule thread blocks in Streaming Multiprocessors (SMs) GPU... SM n blocks are assigned to SMs. SMs contain simple processors with deep pipelines (throughputoriented architecture) An SM can accommodate multiple thread blocks simultaneously. The exact number depends on hardware restrictions. A thread block resides in an SM until its execution is completed. device memory 10

11 OUR METHOD AND EVALUATION 11

12 Tiled BP on GPU A thread block for each tile 12

13 Tiled BP on GPU A Big Picture Barrier Synchronization 13

14 Tiled BP on GPU Finer Granularity One thread per message vector element Looking into a Tile Different Groups of threads in a thread block 14

15 Tiled BP on GPU Finer Granularity Looking into a Tile The same for Up and Down 15

16 Optimization 1 Shared Memory Load data to shared memory at the start of local BP-M for each tile Boundary messages Data vectors of pixels Set reserved space in shared memory for other data in computation Internal messages vectors Outgoing boundary messages 16

17 Optimization 1 Shared Memory All data loading is coalesced Row-wise storage of vertical and horizontal boundary messages for memory coalescing Tile at most 13 by 13 to accommodate all data in (48 KB) shared memory At most 13x16=208 threads in a thread block 17

18 Optimization 1 Shared Memory With maximum tile size all shared memory storage is used for one thread block Given that, each SM can accommodate just one thread block Underutilizing SMs, but suitable for interblock barrier synchronization 18

number of SMs Manual scheduling of thread blocks on tiles 1 S. Xiao and W.-c.

19 Optimization 2 Fast Global Barrier State-of-the-art GPU global barrier 1 Requirements No need to launch multiple kernels, significantly reducing One thread block per SM Number of thread overheads blocks at kernel launch equal to number of SMs Manual scheduling of thread blocks on tiles 1 S. Xiao and W.-c. Feng, Inter-block gpu communication via fast barrier synchronization, in IEEE International Symposium on Parallel & Distributed Processing (IPDPS),

20 Other Optimizations Fast and parallelized message calculation 1 Manual analysis and tuning of the code Removing some of syncthreads instructions 1 C.-C. Cheng, C.-K. Liang, Y.-C. Lai, H. H. Chen, and L.-G. Chen, Fast belief propagation process element for high-quality stereo estimation, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009), IEEE, 2009, pp

21 Evaluation Algorithm Hardware Price (USD) Exec. Time (ms) Tsukuba Accuracy 4 Exec. Time (ms) Judging Test Accuracy 4 BP-M 1 CPU $300 39, , TiledBP 2 CPU $300 1, , TiledBP TiledBP GPU NVIDIA $ GTX GPU NVIDIA $ Tesla C Given reference code on Intel Xeon 3.60GHz 2 TiledBP CPU implementation on Intel Xeon 3.60GHz 3 GTX 680 with 8 SMs and Tesla C2050 with 14 SMs 4 Percentage of accurate depth labels compared to ground truth 21

22 Conclusion New GPU implementation of tiledbp for stereo matching Wavefront computation Inter-block GPU barrier synchronization Evaluation Comparable accuracy Comparable price 200X speed up compared to CPU tiledbp 22

23 BACK-UP SLIDES 23

24 Optimization 2 Fast Global Barrier One thread block per SM Number of thread blocks at kernel launch equal to number of SMs Manual scheduling of thread blocks on tiles Code snippet from: S. Xiao and W.-c. Feng, Inter-block gpu communication via fast barrier synchronization, in IEEE International Symposium on Parallel & Distributed Processing (IPDPS),

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library