Fast Compressive Sensing MRI Reconstruction using multi-gpu system

Size: px

Start display at page:

Download "Fast Compressive Sensing MRI Reconstruction using multi-gpu system"

Osborn Casey
6 years ago
Views:

Electrical and Computer Engineering, Ulsan National Institute of

1 Fast Compressive Sensing MRI Reconstruction using multi-gpu system TRAN MINH QUAN WON-KI JEONG High-performance Visual Computing Laboratory, School of Electrical and Computer Engineering, Ulsan National Institute of Science and Technology (UNIST). UNIST-gil 50 (100 Banyeon-ri), Eonyang-eup, Ulju-gun, Ulsan Metropolitan City. Republic of Korea,

2 Talk Overview Introduction 2D Dynamic Compressive Sensing MRI (CS MRI) Split Bregman (SB) Method Total Variation Fast 2D Discrete Wavelet Transform (DWT) with mixed-band Result of single GPU system Result of multi GPU system Q&A 2

3 2D Dynamic MRI (2.5D MRI) Cardiac MRI Perfusion MRI 3

4 MRI Reconstruction VERY FAST f = Ku K = RF R: sampling mask F: 2D Fourier Transform IFFT2 IFFT2 CS VERY SLOW Traditional MRI Zero Filling Reconstruction CS MRI 4

5 Motivation Why do we use sparse sampling? ~20-40 minutes down to ~1-3 minutes Greatly reduce the scanning time (~16x) Why do we use the GPUs? Speed up the reconstruction time 5

6 CSMRI Problem min u J(u) s.t i Ku i f i 2 < μ Lustig et. al. J u = F z W xy u 1 Goldstein et. al. x z (temporal) J u = xyz u 1 Our method y J u = W xy u 1 + xy u 1 + z u 1 => l 1 minimization problem 6

7 Proposed SB CSMRI Algorithm Initialize u 0 = RF 1 f and d x 0 = d y 0 = d z 0 = w 0 = 0 While u k u k 1 2 > tol u k = min u d k+1 x = max s k xy μ Ku f 2 + λ 2 2 dk u b k 2 + γ 2 wk k Wu b 2 w 1 λ x, 0 xu k +b x k sk xy Sub Optimization Problem End d k+1 y = max s k xy 1, 0 yu k +b y λ k y s xy d k+1 z = max s k z 1, 0 zu k +bk z λ z w k+1 = shrink Wu k+1 + b w k, 1 γ b x k+1 = b x k + x u k+1 d x k+1 b y k+1 = b y k + y u k+1 d y k+1 b z k+1 = b z k + z u k+1 d z k+1 b w k+1 = b w k + Wu k+1 w k+1 s z k k J u = xy u 1 + z u 1 + W xy u 1 Update Bregman Distances (Smoothing/Thresholding) Update Bregman Variables Ref: Goldstein et. al, The Split Bregman Method for L1-Regularized Problems,. SIAM

8 Building Blocks Iterative solver Gradient, Laplacian operators Using Finite Difference Method Discrete Fourier transform CUFFT Discrete Wavelet transform Fast GPU mixed-band algorithm 8

9 2D Wavelet Transform Traditional Approach * * * 9

10 2D Wavelet Transform Traditional vs. Mixed-band 10

11 2D Wavelet Transform with Mixed-band (1) Haar 2x2 M = a b c d W = G = W M W T Haar 4x4 G = 1 2 +a + b + c + d +a + b c d +a b + c d +a b c + d Haar 8x8 11

12 2D Wavelet Transform with Mixed-band (2) encode_8 Kernel decode_8 Kernel Why do we choose block Size 8x8? 12

13 Optimize 2D Haar Wavelet (1) G = 1 +a + b + c + d 2 +a + b c d Broadcasting +a b + c d +a b c + d global void encode_8(float2* src, float2* dst, int nrows, int ncols, int irows, int icols) { //Read a 8x8 block from global memory to shared memory... syncthreads(); float2 a, b, c, d; //Registers type, each thread will have its own values //First time Haar 2x2 if(((tid.y&0)==0)&&((tid.x&0)==0)) { a = smem[(tid.y & (~1)) + 0][(tid.x & (~1)) + 0]; b = smem[(tid.y & (~1)) + 0][(tid.x & (~1)) + 1]; c = smem[(tid.y & (~1)) + 1][(tid.x & (~1)) + 0]; d = smem[(tid.y & (~1)) + 1][(tid.x & (~1)) + 1]; } syncthreads(); Haar 2x2 if(((tid.y&0) == 0)&&((tid.x&0) == 0)) smem[(tid.y][tid.x] = 0.5f * (a + b + c + d); else if(((tid.y&0) == 0)&&((tid.x&0) == 1)) smem[(tid.y][tid.x] = 0.5f * (a - b + c - d); else if(((tid.y&0) == 1)&&((tid.x&0) == 0)) smem[(tid.y][tid.x] = 0.5f * (a + b - c - d); else if(((tid.y&0) == 1)&&((tid.x&0) == 1)) smem[(tid.y][tid.x] = 0.5f * (a - b - c + d); syncthreads(); Divergence } //Second time Haar 2x2... //Third time Haar 2x

14 Optimize 2D Haar Wavelet (2) device void switchsign(unsigned int intsign, float2* number) { *(number) *= intsign? (-1.0f):(1.0f); } G = 1 2 H 2 = 1 2 +a + b + c + d +a + b c d +a b + c d +a b c + d Recursive Representation +H d 1 +H d 1 H d 1 H d = 1 2 +H d 1 Synthetic Representation H d i,j = 1 (i j) 2 d 2 1 is bitwise dot product global void encode_8(float2* src, float2* dst, int nrows, int ncols, int irows, int icols) { //Read a 8x8 block from global memory to shared memory... syncthreads(); float2 a, b, c, d; //Registers type, each thread will have its own values //First time Haar 2x2 if(((tid.y&0)==0)&&((tid.x&0)==0)) { a = smem[(tid.y & (~1)) + 0][(tid.x & (~1)) + 0]; b = smem[(tid.y & (~1)) + 0][(tid.x & (~1)) + 1]; c = smem[(tid.y & (~1)) + 1][(tid.x & (~1)) + 0]; d = smem[(tid.y & (~1)) + 1][(tid.x & (~1)) + 1]; } switchsign( (((tid.y>>0 & 1) & 0) ^ ((tid.x>>0 & 1) & 0)), &a); switchsign( (((tid.y>>0 & 1) & 0) ^ ((tid.x>>0 & 1) & 1)), &b); switchsign( (((tid.y>>0 & 1) & 1) ^ ((tid.x>>0 & 1) & 0)), &c); switchsign( (((tid.y>>0 & 1) & 1) ^ ((tid.x>>0 & 1) & 1)), &d); smem[(tid.y][tid.x] = 0.5f * (a + b + c + d); } syncthreads(); //Second time Haar 2x2... //Third time Haar 2x (3 2) = 1 1,1 (1,0) = = 1 14

15 Optimize 2D Haar Wavelet (3) device void switchsign(unsigned int intsign, float2* number) { *(number) *= intsign? (-1.0f):(1.0f); } device void switchsign(unsigned int intsign, float2* number) { *((long long int*)number) ^= intsign? 0x : 0x ; } Complex number y = m + i*n float2.x float2.y 2D WAVELET TRANSFORM WITH MIXED-BAND Image size 512 x 512 Shared Memory Multiply with -1 or +1 Casting signed bit 1 lvl 3 lvls 9 lvls 1 lvl 3 lvls 9 lvls 1 lvl 3 lvls 9 lvls 2D Forward Wavelet N/a N/a D Inverse Wavelet N/a N/a Unit: Miliseconds 15

16 Comparison of Lenna (512x512): Full Decomposition 9 levels Filter Scheme (GPU): milisecond Lifting Scheme (GPU) : milisecond Mixed-band (GPU) : millisecond (20x faster) 16

17 Put everything together with 1 GPU min u J(u) s.t i Ku i f i 2 < μ J u = xy u 1 + z u 1 + W xy u 1 17

18 Results of 2D Dynamic MRI Flank Tumor Dataset (256 slices) 1/1 1/4 1/8 1/10 1/12 1/16 Image size: 128x128 18

19 Performance of the CSMRI reconstruction (in milliseconds) Operations 32 slices 128 slices 256 slices Inverse Differentiation Compute Right Hand Side Sub Optimization Prob (Modified Richardson) Forward Differentiation Shrinkage Update Bregman Parameter Update Kspace Total Image size: 128x128 19

20 MultiGPU system information ~]$ lspci -tv Advanced ~]$ lspci Micro tvdevices [AMD] nee ATI RD Northbridge Advanced Micro only dual Devices slot [AMD] (2x8) nee PCI-e ATI GFX RD890 Hydra part Northbridge only dual slot (2x8) PCI-e GFX Hydra part NVIDIA Corporation Tesla M \ NVIDIA NVIDIA Corporation Corporation Tesla Tesla M2090 M2090 \ NVIDIA Corporation Tesla M2090 \ NVIDIA Corporation Tesla M2090 \ \ NVIDIA NVIDIA Corporation Corporation Tesla Tesla M2090 M2090 \ NVIDIA Corporation Tesla M2090 \ Advanced Micro Devices [AMD] nee ATI RD890 \ Northbridge Advanced Micro only Devices dual slot [AMD] (2x8) nee PCI-e ATI RD890 GFX Hydra part Northbridge only dual slot (2x8) PCI-e GFX Hydra part NVIDIA Corporation Tesla M \ NVIDIA NVIDIA Corporation Corporation Tesla Tesla M2090 M2090 \ NVIDIA Corporation Tesla M2090 \ NVIDIA Corporation Tesla M2090 \ \ NVIDIA NVIDIA Corporation Corporation Tesla Tesla M2090 M2090 \ NVIDIA Corporation Tesla M2090 MPI 20

21 MultiGPU Implementation (1) OpenMP J u = xy u 1 + z u 1 + W xy u 1 cudamemcpypeer2peer ~6.1 GB/s. Paulius M. Implementing 3D Finite Difference code on GPUs GTC

22 MultiGPU Implementation (2) OpenMP J u = xy u 1 + z u 1 + W xy u 1 cudamemcpypeer2peer ~6.1 GB/s. Paulius M. Implementing 3D Finite Difference code on GPUs GTC

23 MultiGPU Implementation (3) OpenMP J u = xy u 1 + z u 1 + W xy u 1 cudamemcpypeer2peer ~6.1 GB/s. Paulius M. Implementing 3D Finite Difference code on GPUs GTC

24 Performance on multiple GPUs Operations Inverse Differentiation Compute Right Hand Side Sub Optimization Problem Forward Differentiation Shrinkage Update Bregman Parameter Update Kspace Data Transfer Total Operations Inverse Differentiation Compute Right Hand Side Sub Optimization Problem Forward Differentiation Shrinkage Update Bregman Parameter Update Kspace Data Transfer Total 1 GPUs 2 GPUs Sync 4 GPUs Sync GPUs Sync

25 Time (miliseconds) Scalability x 1.7x 2.9x 4.3x Number of GPUs 25

26 Conclusion Summary Split Bregman Formulation for dynamic CSMRI 2D DWT on the GPU using mixed-band algorithm Multi-GPU implementation using P2P communication Acknowledgement Thanks to HyungJoon Cho and SoHyun Han for data and discussion. Funding from NRF Grant # 2012R1A1A

27 Thank you 27

Compressive Sensing Algorithms for Fast and Accurate Imaging

Compressive Sensing Algorithms for Fast and Accurate Imaging Wotao Yin Department of Computational and Applied Mathematics, Rice University SCIMM 10 ASU, Tempe, AZ Acknowledgements: results come in part