Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

Size: px

Start display at page:

Download "Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU"

Isabel Banks
5 years ago
Views:

1 Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

2 The myth 10x-1000x speed up on GPU vs CPU Papers supporting the myth: Microsoft: N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete fourier transforms on graphics processors (2008) NVIDIA: NVIDIA CUDA Zone (2009) Others: C. Bennemann, M. Beinker, D. Egloff, and M. Gauckler. Teraflops for games and derivatives pricing L. Genovese. Graphic processing units: A possible answer to HPC (2009) M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens. Efficient computation of sum-products on gpus through software-managed cache (2008) J. Tolke and M. Krafczyk. TeraFLOP computing on a desktop pc with GPUs for 3D CFD (2008) F. Vazquez, E. M. Garzon, J.A.Martinez, and J.J.Fernandez. The sparse matrix vector product on GPUs (2009) Z. Yang, Y. Zhu, and Y. Pu. Parallel Image Processing Based on CUDA (2008)

3 CPU vs GPU Processing element CPU designed for general computing GPU designed for throughput Cache size vs multi-threading CPU has large multilevel caches to reduce memory access delay GPU uses multi-threading to mitigate memory access delay Bandwidth CPU (Core i7 960) has a memory bandwidth of 32 GB/s GPU (GTX 280) has a memory bandwidth of 141GB/s Other elements

4 Method Ran 14 different kernels on CPU and GPU Kernels from many different applications Code was optimized for both CPU and GPU individually High performance CPU and GPU Intel Core i7-960 (3.2 GHz and 4 cores) evga GeForce GTX280

5 Results GPU was on average 2.5x faster than the CPU Two kernels were faster on the CPU Only one kernel was more than 6x faster (14.9x) Due to use of the texture sampler

6 Supporting papers Intel: C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. Nguyen, T. Kaldewey, V. Lee, S. Brandt, and P. Dubey. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs (2010) N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, and P. Dubey. Fast Sort on CPUs and GPUs: A Case For Bandwidth Oblivious SIMD Sort (2010) M. Smelyanskiy, D. Holmes, J. Chhugani, A. Larson, D. Carmean, D. Hanson, P. Dubey, K. Augustine, D. Kim, A. Kyker, V. W. Lee, A. D. Nguyen, L. Seiler, and R. A. Robb. Mapping high-fidelity volume rendering for medical imaging to cpu, gpu and many-core architectures (2009) Nvidia: N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs (2009) Others: V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. (2008)

7 Analysis Bandwidth bound kernels benefit from the GPU's larger bandwidth Two bandwidth bound kernels were 5x faster on the GPU One bandwidth bound kernel was only 2x faster due to a larger cache on the CPU Compute bound kernels benefit from the GPU's higher FLOPS Three compute bound kernels were 3x-4x faster on the GPU One compute bound kernel was only 2x faster due to requiring double precision arithmetic. One compute bound kernel was 6x faster due to fast transcendental operation on the GPU One compute bound kernel was faster on the CPU due to its better buffer management and data scatter handling

8 Analysis A larger cache reduces the need for bandwidth, helping bandwidth bound kernels One kernel was twice as fast when the working set fit in the cache Several kernels have working sets that scales with the number of threads, increasing the risk of cache misses on the GPU. Gather/scatter has HW acceleration on the GPU One kernel is highly dependent on gather operations resulting in a 15x performance improvement One kernel dependent on widely spread scatter operations was only 1.6x faster on the GPU due to memory bandwidth limitations. Reduction and synchronization The CPU has HW synchronization that improves the performance of one kernel (2x), while another was faster on the GPU (1.8x) Fixed function units improves performance of kernels that can make use of them

9 Explanation of the myth Some papers compare a high performance GPU to a mobile CPU Many studies compare unoptimized CPU code to optimized GPU code

10 Recommendations High compute flops and memory bandwidth Large Cache Gather/scatter Efficient synchronization Fixed function units

11 That's all, folks!

Debunking the 100x GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

Debunking the 100x GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presenter: Victor Lee victor.w.lee@intel.com Throughput Computing Lab, Intel Architecture Group GPUs is 10 100x