High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Motivation Fourier Transform widely used in Physics, Astronomy, Engineering etc. Applications include signal processing, fluid dynamics. Original 1d DFT algorithm is O(N 2 ) FFT improves the time complexity to O(N.logN) Nevertheless, FFT is still computationally intensive Continued advances of computers demand larger and faster implementations of the algorithm.

2D Fourier Transform equation The two dimensional Fourier transform equation is given by It can be modified to : The term in square brackets corresponds to the one-dimensional Fourier transform of the m th line and can be computed using the standard fast Fourier transform (FFT). Each line is substituted with its Fourier transform, and the one-dimensional discrete Fourier transform of each column is computed.

Our AIM and Previous Work Our AIM: To develop an efficient Heterogeneous CPU-GPU implementation of 2-dimensional FFT. There have been attempts on heterogeneous 2D FFT, namely, An Efficient, Model-Based CPU-GPU Heterogeneous FFT Library, IPDPS 2008 Summary of paper: The library achieves optimal performance using heterogeneous CPU-GPU computing resources. The load distribution ratio is automatically predicted from a performance model. Compute 2D FFT using 1D FFT libraries such as CUFFT, MKL, FFTW etc. Results : The heterogeneous library is around two times slower than the existing best GPU FFT libraries. This is because of the overhead of transposing matrices and multiple data transmissions between Cpu and Gpu.

Shortcomings and Strengths Strengths: The Load balancing ratio is automatically estimated from a performance model, and the error in prediction is quite low. Hence, less human burden. The library can handle data of very large sizes, which is not the case with pure GPU libraries due to memory limitations on Gpu. Faster implementations of FFT are possible only through heterogeneous architectures. Shortcomings: They use only 1 thread on the Dual Core. The other core is occupied by the GPU control thread. Transposing of large matrices being done on the CPU, during which Gpu remains idle. No work done during data transmission. Hardware and software resources available today is much advanced then the one used in paper (Nvidia Geforce 8800GTX with 128 cores and Intel Core 2 Duo E6400, 2.13GHz) One would prefer using pure GPU FFT libraries due to better performance.

ISCA 2010 Gpu-Cpu myth paper The paper Debunking the 100X GPU vs. CPU Myth, ISCA 2010 highlights the architectural features of Gpu and Cpu which contribute to the performance gap between the two. After optimizing 14 kernels on both Gpu and Cpu, the performance gap between and Nvidia GTX280 processor and Intel Core I7 960 narrows to only 2.5x on average. (For FFT, the ratio is 3) The paper reports CUFFT as the best implementation of 1d FFT on GPUs and MKL as best implementation on multi-core. We have seen that 2d FFT is composed of 1d FFT along rows, and then 1d FFT along columns. Can use the above mentioned libraries in our Hybrid algorithm.

Pure GPU and Pure CPU timings FFT Size CUFFT time (in ms), Tesla GPU 1000 x 1000 1.45 6.00 2048 x 2048 6.75 19.00 8000 x 8000 87.5 400.00 8192 x 8192 173 400.00 MKL time (in ms), Core i7, 12 threads

Experimental Setup We use a machine having an Nvidia Tesla processor in combination with an Intel Core I7 980. Tesla has 30 streaming multi-processors, each having 8 CUDA cores, giving a total of 240 cores. Core I7 can run 12 threads at a time. We use Linux Kernel with Nvidia display driver and CUDA version 4.0 The 2D matrices used in the following experiments consist of two 32-bit floating-point numbers, each of which represents real or complex part, respectively.

Approach The basic approach is as follows: o The first Transpose can be shifted to either side (GPU / CPU), whichever is more effective. o The second Transpose is a crucial step. We can exploit the pipelining pattern present in the system. o All the data transfer present in algorithm can be overlapped with Transpose kernel using CUDA streams. o If all the Data transfer time is hidden, effective time will be (FFT computation time + Transpose computation time)

CUDA streams Streams help in achieving concurrent kernel / memcpy execution. When streams are not specified, default stream is 0 and operations are not concurrent. Approx, all the data transfer time is hidden by overlapping with kernel execution. Transfer Data A Kernel on Data A Transfer Data B Kernel on Data B

Hybrid Algorithm Distribute and Transfer Data Transpose (GPU) Divide the 2D signal into GPU_rows and CPU_rows. Transfer signal to the GPU Transpose CPU rows Transpose GPU rows Transfer CPU rows to host Row-FFT Use CUFFT for Row-FFT on GPU rows Use MKL for Row-FFT on CPU rows Transpose (GPU) Transpose GPU rows Transfer CPU rows to device Transpose CPU rows Column-FFT Transfer Transposed CPU rows Use CUFFT for column FFT on GPU rows Use MKL for column FFT on CPU rows Final output transfer Transfer Output back to host that is currently on GPU

Timings (Threshold = 90%) FFT Size Pure GPU time Hybrid Algorithm time Transpose Time FFT computati on time 1024 x 1024 1146 3150 1700 700 750 2048 x 2048 6000 12000 6800 2800 2200 4096 x 4096 28198 52615 30554 14000 8230 Data Transfer time for CPU rows 8192 x 8192 156180 226461 122614 71451 32396 More than half of the time is spent in Transpose computations How to hide Transpose time? Maybe use CPU for Transposing by pipelining : FFT -> Data-Transfer -> Transpose (CPU)

Using CPU for Transposing Problems By the time GPU does row/column FFT on entire signal, only 10% of the data can be transferred to CPU end for transposing CPU transpose time much higher than CUDA kernel for the same The transposed data must be transferred again to GPU-end for FFT. One more pipeline!

Hiding Transpose time If Transpose time needs to be hidden: Much higher bandwidth would be required than what is currently available We need an equally powerful device as a GPU Multi-GPU system?

Review The previous algorithm was as follows: Total time taken = 52000 micro-seconds CUFFT benchmark timing = 28000 micro-seconds

Hiding Transpose The algorithm can be seen as a series of row-fft -> Transpose -> row-fft -> Transpose Transpose is the bottleneck in the algorithm. Transpose can be hidden with the help of following possible pipeline: If transpose time is hidden, the effective time of the algorithm will be twice the time taken for row-fft Deciding whom to assign row-fft and Transpose is not trivial

Steps to implement the pipeline 1. Divide the signal into chunks of rows 2. Perform FFT at CPU side on one chunk and after finishing start transferring it to GPU 3. CPU starts row FFT on another chunk while GPU does transpose of the previous chunk 4. Repeat the steps till all chunks are transferred and transposed on GPU end

Implementation using 2 streams

Operations per chunk Chunk size CPU row-fft per chunk 10 280 275 20 377 530 50 620 1200 100 1000 2200 Transfer time per chunk Time required to transfer 1 chunk from CPU to GPU > Time required for FFT of 1 chunk on CPU Timings equal at small chunk-size but of no use Effective time of the pipeline will be dominated by the transfer time.

Result and Observations As expected, the total time of the pipeline is dominated by the one-way data transfer time. Under-utilization of GPU The entire signal is transferred over time, which is very inefficient practice Need for data-decomposition and assign FFT computations to the GPU

Eliminate Transpose Can we eliminate explicit transpose computation? Doing Row-FFT and Transpose in a single step is possible Read data in row-major form and write the FFT output in column-major form. But timing of the operation should be close to original Row-FFT timing Total time = 2 * (Row-FFT time)

Using Stride parameters It is possible to specify STRIDE for input and output data. For instance, consider a simple Row-Row formulation Here, input stride = output stride = 1, input dist = output dist = Y

Another Example For doing FFT along the columns Here, input stride = output stride = Y, input dist = output dist = 1

4 Types of Operations Row Row Read in row major and write in row major Row - Column Read in row major and write in column major Column Row Read in Column major and write in Row major Column Column Read in column major and write in column major

Timings Operation Time 1024 x 1024 Time 2048 x 2048 Row-Row 140 470 Row-Column 245 1550 Column-Row 265 2770 Column-Column 273.5 3138 Due to less spatial locality, operations other than row-row take more time If 2D FFT = Row-Row + Column-Column Time taken = 140 + 273.5 = 413 usec (1024x1024) = 470 + 3138 = 3600 usec (2048x2048) The benchmark timing is 380 usec (1024x1024) and 1480 usec (2048x2048)

Time (in usec) Row Row followed by Column-Column 4000 3500 3000 2500 2000 1500 Benchmark Strided FFT 1000 500 0 512 1024 2048 Size of 2D FFT

Time (in usec) Row Column followed by Row-Column 3500 3000 2500 2000 1500 Benchmark Strided FFT 1000 500 0 512 1024 2048 Size of 2D FFT

Time (in usec) Column-Row followed by Column-Row 6000 5000 4000 3000 2000 Benchmark Strided FFT 1000 0 512 1024 2048 Size of 2D FFT

Can Hybrid Help? The FFT implementations using strides are 2x slower than benchmark. Using CPU in parallel can improve the timings by 10-20% May not be useful enough. Rows/Columns can be distributed between CPU and GPU Possibly, estimate the division threshold from the available bandwidth.