A GPU Implementation of Tiled Belief Propagation on Markov Random Fields Hassan Eslami Theodoros Kasampalis Maria Kotsifakou
BP-M AND TILED-BP 2
BP-M 3
Tiled BP T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 4
Tiled BP Reading boundary messages from memory Local computation on local data Writing the resulting boundary Local BP-M messages to memory 5
Tiled BP 6
BACKGROUND ON GPU 7
GPU Programming Model (host) kernel launch <<<GridDim,BlockDim>>> (args) device Block Grid (0,0) (0,0) (0,0)... Block (0,0) Block (0,1)... (0,0) (0,0) Block (1,0) Block (1,1) (0,0)...... Number of threads and thread blocks is specified at kernel launch All threads execute the same kernel function 8
GPU Memory Model Grid Block(0,0) shared memory (0,0) (1,0)... (0,1) (1,1)... Block(0,0) shared memory (0,0) (1,0)... (0,1) (1,1)... Global Memory: accessible by all threads Shared Memory: scratchpad memory, shared by threads within a thread block Other components of the memory hierarchy are not shown (registers, constant memory, caches) global memory s within a thread block are cooperative and can synchronize. syncthreads() 9
Kernel Execution kernel Block Block... Block SM 0 Schedule thread blocks in Streaming Multiprocessors (SMs) GPU... SM n blocks are assigned to SMs. SMs contain simple processors with deep pipelines (throughputoriented architecture) An SM can accommodate multiple thread blocks simultaneously. The exact number depends on hardware restrictions. A thread block resides in an SM until its execution is completed. device memory 10
OUR METHOD AND EVALUATION 11
Tiled BP on GPU A thread block for each tile 12
Tiled BP on GPU A Big Picture Barrier Synchronization 13
Tiled BP on GPU Finer Granularity One thread per message vector element Looking into a Tile Different Groups of threads in a thread block 14
Tiled BP on GPU Finer Granularity Looking into a Tile The same for Up and Down 15
Optimization 1 Shared Memory Load data to shared memory at the start of local BP-M for each tile Boundary messages Data vectors of pixels Set reserved space in shared memory for other data in computation Internal messages vectors Outgoing boundary messages 16
Optimization 1 Shared Memory All data loading is coalesced Row-wise storage of vertical and horizontal boundary messages for memory coalescing Tile at most 13 by 13 to accommodate all data in (48 KB) shared memory At most 13x16=208 threads in a thread block 17
Optimization 1 Shared Memory With maximum tile size all shared memory storage is used for one thread block Given that, each SM can accommodate just one thread block Underutilizing SMs, but suitable for interblock barrier synchronization 18
Optimization 2 Fast Global Barrier State-of-the-art GPU global barrier 1 Requirements No need to launch multiple kernels, significantly reducing One thread block per SM Number of thread overheads blocks at kernel launch equal to number of SMs Manual scheduling of thread blocks on tiles 1 S. Xiao and W.-c. Feng, Inter-block gpu communication via fast barrier synchronization, in IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010 19
Other Optimizations Fast and parallelized message calculation 1 Manual analysis and tuning of the code Removing some of syncthreads instructions 1 C.-C. Cheng, C.-K. Liang, Y.-C. Lai, H. H. Chen, and L.-G. Chen, Fast belief propagation process element for high-quality stereo estimation, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009), IEEE, 2009, pp. 745 20 748.
Evaluation Algorithm Hardware Price (USD) Exec. Time (ms) Tsukuba Accuracy 4 Exec. Time (ms) Judging Test Accuracy 4 BP-M 1 CPU $300 39,802 79.8 39,767 86.5 TiledBP 2 CPU $300 1,585.85 82.1 1,586.75 80.9 TiledBP TiledBP GPU NVIDIA $500 9.29 82.1 GTX 680 3 GPU NVIDIA $1350 7.96 82.1 Tesla C2050 3 9.24 80.9 115.0 83.8 7.95 80.9 90.7 83.8 1 Given reference code on Intel Xeon E5-1620 @ 3.60GHz 2 TiledBP CPU implementation on Intel Xeon E5-1620 @ 3.60GHz 3 GTX 680 with 8 SMs and Tesla C2050 with 14 SMs 4 Percentage of accurate depth labels compared to ground truth 21
Conclusion New GPU implementation of tiledbp for stereo matching Wavefront computation Inter-block GPU barrier synchronization Evaluation Comparable accuracy Comparable price 200X speed up compared to CPU tiledbp 22
BACK-UP SLIDES 23
Optimization 2 Fast Global Barrier One thread block per SM Number of thread blocks at kernel launch equal to number of SMs Manual scheduling of thread blocks on tiles Code snippet from: S. Xiao and W.-c. Feng, Inter-block gpu communication via fast barrier synchronization, in IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010 24