Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi

Size: px

Start display at page:

Download "Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi"

Silas Jacobs
6 years ago
Views:

1 Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 esaule@uncc.edu, {kamer,umit}@bmi.osu.edu 1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering The Ohio State University PPAM 213 Monday Sept 9 Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

2 Outline 1 The Intel MIC Architecture 2 SpMV 3 SpMM 4 Conclusion Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

3 What is Intel MIC? Intel Many integrated Core (MIC) is Intel s response to GPUs becoming popular in High Performance Computing. What GPUs do well? Get a lot of GFlops by using hundreds of cores Each core has large SIMD-like abilities Hide memory latency by using 1 cycle context switch What GPUs do not do well? Alien to program Poor support for legacy applications Inter thread communications Branching Goal of Intel MIC: do all of it well! Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

4 Overall Architecture Core Memory Controller PCI-e Controller 8 memory controllers GDDR5 2 channels (32-bit) 5.5GT/s 352GB/s aggregated peak twice the GPU s but you typically get half Ring bus at 22GB/s 5+ cores 32KB of L1 cache 512KB of L2 cache LRU 8-way associative 1 PCI-e controller to the host (2GB/s guaranteed to memory) Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

5 Core Architecture source: Intel 64-bit 4 hardware threads no context switching consecutive instr. from different threads A vectorial unit bit wide registers sqrt, rsqrt, log, exp mul, div, add, sub, fma permutation swizzling masking Two instruction pipes: 2 ALU ops ALU + MEM ops ALU + VFP ops VFP + MEM ops In-order execution Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

6 When Should I Use Intel MIC? Key points of SE1P Large memory bandwidth (peak: 22GB/s) 61 cores with mandatory use of hardware threading 512-bit wide SIMD registers: FMA: up to 2x16 SP Flop/c (2x8 DP Flop/c) otherwise: up to 16 SP Flop/c (8 DP Flop/c) On a 61-core configuration at 1.5Ghz: FMA: 2x16x61x1.5Ghz = 2.48 TFlop/s SP (1.24TFlop/s DP) otherwise: 16x61x1.5Ghz = 1.24 GFlop/s SP (512GFlop/s DP) Lots of bandwidth? Fused Multiply Add? Large vector registers? Sounds like the perfect system for Sparse Matrix operations! Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

7 Getting actual bandwidth Bandwidth (in GB/s) Ring BW Bandwidth (in GB/s) Ring BW loop-char loop-int vect vect+pref Read Write Using the appropriate vectorial instructions gives significant improvements. Read Peak: 183GB/s. Write Peak: 16GB/s. store store-nr store-nrngo Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

8 Outline 1 The Intel MIC Architecture 2 SpMV 3 SpMM 4 Conclusion Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

9 SpMV Compressed Storage by Row Test instances from UFL Sparse Matrix Collection max max # name #row #nonzero density nnz/row nnz/r nnz/c 1 shallow water1 81,92 24,8 3.5e cubes sphere 11, , e scircuit 17, , e mac econ 26,5 1,273, e cop2k A 121,192 1,362, e cant 62,451 2,34, e pdb1hys 36,417 2,19, e webbase-1m 1,,5 3,15, e hood 22,542 5,57, e bmw ,362 5,757, e pre2 659,33 5,834, e pwtk 217,918 5,871, e crankseg 2 63,838 7,16, e torso1 116,158 8,516,5 6.31e atmosmodd 1,27,432 8,814, e msdoor 415,863 9,794, e F1 343,791 13,59, e nd24k 72, 14,393, e inline 1 53,712 18,659, e mesh 248 4,194,34 2,963, e ldoor 952,23 21,723,1 2.39e cage14 1,55,785 27,13, e Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

10 Optimization levels Performance (in GFlop/s) No Vect. Comp. Vect Variable performance Variable impact of vectorization vgatherd x v[:7] = x[adj[:7]] Takes one cycle per cache line that spans x[adj[:7]] Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

11 Optimization levels Performance (in GFlop/s) No Vect. Comp. Vect Variable performance Variable impact of vectorization vgatherd x v[:7] = x[adj[:7]] Takes one cycle per cache line that spans x[adj[:7]] 25 No Vect. Comp. Vect. 2 Performance (in GFlop/s) Useful Cacheline Density Fraction of the accessed cache lines of x that is useful for computing y[i] Useful cache line density Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

12 A Bandwidth point of view Naive The matrix is transferred once. Bandwidth (in GB/s) Hardware 512KB cache Hardware infinite cache Application Naive Application The matrix and the vectors are transferred once. Hardware infinite cache A core that access an entry from x brings the whole cacheline in Hardware 512KB cache A cacheline might be transferred multiple times to a core if the cache is full. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

13 A Bandwidth point of view Bandwidth (in GB/s) Hardware 512KB cache Hardware infinite cache Application Naive Provided the peak bandwidth is between 16GB/s and 18GB/s. That s close to optimal for some matrices. Naive The matrix is transferred once. Application The matrix and the vectors are transferred once. Hardware infinite cache A core that access an entry from x brings the whole cacheline in. Hardware 512KB cache A cacheline might be transferred multiple times to a core if the cache is full. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

14 So bandwidth constraint? Bandwidth (in GB/s) Max Sustained Read BW Max Sustained Write BW 4 thr/core 3 thr/core 2 thr/core 1 thr/core crankseg 2 number of cores There is a contention inside the cores. More threads do not help. There is a hint at contention on the Xeon Phi. Scaling is similar to max bandwidth. Bandwidth constraint? Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

15 So bandwidth constraint? Bandwidth (in GB/s) Max Sustained Read BW Max Sustained Write BW 4 thr/core 3 thr/core 2 thr/core 1 thr/core Bandwidth (in GB/s) Max Sustained Read BW Max Sustained Write BW 4 thr/core 3 thr/core 2 thr/core 1 thr/core number of cores number of cores crankseg 2 There is a contention inside the cores. More threads do not help. There is a hint at contention on the Xeon Phi. Scaling is similar to max bandwidth. Bandwidth constraint? pre2 No contention inside the cores. More threads helps. No global contention. Linear scaling when adding cores. Latency constraint? Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

16 Outline 1 The Intel MIC Architecture 2 SpMV 3 SpMM 4 Conclusion Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

17 SpMM SpMV gets low GFlop/s because of the flop-to-byte ratio is SpMM 2flop 12bytes /nnz. Put k SpMV together. The ratio becomes to 2k 12. We experiment with k = 16 Applications PDE eigensolving graph based recommendations Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

18 SpMM Performance Manual Vect. + NRNGO Manual Vect. Comp. Vect Performance (in GFlop/s) Variants Basic C++. 8 columns at a time with fma. Using proper store operation. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

19 Bandwidth Hardware 512KB cache Hardware Infinite cache Application Bandwidth (in GB/s) Bandwidth analysis Where x goes is much more important. Cache is still large enough. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

20 Outline 1 The Intel MIC Architecture 2 SpMV 3 SpMM 4 Conclusion Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

21 Comparison with other architectures Performance (in GFlop/s) SE1P C25 K2 Dual X568 Dual E Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

22 Conclusion Xeon Phi gets good performance on Sparse Matrix kernels. Vectorization is paramount. Register blocking could improve usage, but matrices are too sparse. Because of vgatherd useful cacheline density matters. Test on plenty of sparse matrices or risk a bias. Most still use only a few. Locality is key (but ordering has little effect). Observing performance at different number of cores and thread per core hints at what is happening. Still low utilization on many matrices Blocked ELLPACK has been developed. Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

23 Thank you Support Intel for providing Xeon Phi cards. NVIDIA for providing C25 and K2 cards. OSC for providing computation infrastructure. More information contact : esaule@uncc.edu visit: or Erik Saule (OSU) Using Xeon Phi on Sparse Matrices PPAM / 2

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation