Performance, Scalability, and Numerical Stability of Manycore. Wen-mei Hwu University of Illinois at Urbana-Champaign

Size: px

Start display at page:

Download "Performance, Scalability, and Numerical Stability of Manycore. Wen-mei Hwu University of Illinois at Urbana-Champaign"

Bruno Owen
5 years ago
Views:

1 Prformn, Slility, nd Numril Stility of Mnyor Algorithms Wn-mi Hwu Univrsity of Illinois t Urn-Chmpign

2 Cry XE6 Nods Blu Wtrs ontins,64 Cry XE6 omput nods. Dul-sokt Nod Two AMD Intrlgos hips 6 or moduls, 64 thrds GFs pk prformn 64 GBs mmory GB/s mmory ndwidth Gmini Intronnt Routr hip & ntwork intrf Injtion Bndwidth (pk) 9.6 GB/s pr dirtion

3 Cry XK7 Nods Blu Wtrs ontins,7 Cry XK7 omput nods. Dul-sokt Nod On AMD Intrlgos hip 8 or moduls, thrds 56.5 GFs pk prformn GBs mmory 5 GB/s ndwidth On NVIDIA Kplr hip. TFs pk prformn 6 GBs GDDR5 mmory 5 GB/s ndwidth Gmini Intronnt Sm s XE6 nods

4 NAMD Initil Prformn Rsults million tom nhmrk with Lngvin dynmis nd PME on vry 4 stps, from lunh to finish, ll I/O inludd 768 nods, Kplr+Intrlgos is.9x fstr ovr Intrlgos-only 768 nods, XK7 is.8x XE6 Chrom Ltti QCD prmtrs: grid siz of 48 x 5 running t th physil vlus of th qurk msss 768 nods, Kplr+Intrlgos is 4.9X fstr ovr Intrlgos-only 768 nods, XK7 is.4x XE6 QMCPACK Full run Grphit 4x4x (56 ltrons), QMC followd y VMC 7 nods, Kplr+Intrlgos is 4.9X fstr ovr Intrlgos-only 7 nods, XK7 is.7x XE6

5 Slility vs. Numril Stility A Mjor Algorithm Dsign Chllng 5 Prlllism Prlllism to fill growing HW prlllism Complxity nd dt slility Oprtions should grown linrly with dt siz Lolity DRAM ursts nd h sp utiliztion Rgulrity SIMD utiliztion nd lod ln Numril Stility Pivoting for linr systm solvrs

6 A Comprison of TDS on Mjor Pltforms 6 John Strtton, UIUC August -7

7 GPU Tridigonl Systm Solvr Cs Study Hyrid Mthods PCR-Thoms (Kim, Dvidson ) CR-PCR (CUSPARSE ) Et Numrilly unstl Thoms (squntil) Cyli Rdution ( stp) PCR ( stp)

8 Pivoting Judiiously swp rows to void d ss - - 8

9 Prolm Domposition SPIKE (Polizzi t l) A X = F A = DS D (SX) = F D Y = F (stp ) SX = Y (stp )

10 Forming S All i tils ll solvd in prlll

11 Put th stl squntil lgorithm insid h GPU thrd Eh thrd will pross on til y itslf with squntil, numrilly stl pivoting lgorithm Not tht h thrd ssing th first lmnt of its own til will rsult in lrg, stridd sss

12 Mmory Lyout Issu thrd thrd thrd thrd thrd thrd thrd thrd

13 GPU Mmory Bndwidth vs. Strid SAXPY with strid: y[i * strid ] = * x[ i * strid ] + y[i * strid ]; "Effiint Sprs Mtrix-Vtor Multiplition on CUDA" Nthn Bll nd Mihl Grlnd, in, "NVIDIA Thnil Rport NVR-8-4",,

14 Tils Prossd y Eh Thrd Eh til: Lyout of ll tils: (similr to ELL for trnsposition) 4

15 Anothr Dt Lyout Altrntiv ASTA divid into tils 5

16 6 ASTA Dt Lyout

17 In-pl Trnspostion Stp // dt[w][h]-->dt[h][w] prlll for (j<w) prlll for (i<h) flot tmp = dt[j][i]; //offst = j*h + i 7

18 In-pl Trnspostion: Brrir // dt[w][h]-->dt[h][w] prlll for (j<w) prlll for (i<h) flot tmp = dt[j][i]; //offst = j*h + i rrir(); 8

19 In-pl Trnspostion: Stp // dt[w][h]-->dt[h][w] prlll for (j<w) prlll for (i<h) flot tmp = dt[j][i]; //offst = j*h + i rrir(); dt[i][j] = tmp; //offst = i*w + j 9

20 AoS to ASTA Trnsformtion AoS to ASTA Mrshling Krnl Glol Mmory Throughput (GB/s) Fin Print Out-of-Pl 8 x Sp In-Pl Brrir Syn 95 Til Siz (tunl) < On-hip Mmory

21 Dynmi Tiling John Strtton, UIUC August -7

22 Cost nd Bnfit of ASTA Lyout Mrshling John Strtton, UIUC August -7

23 Error nd Stility John Strtton, UIUC August -7

24 Spd 4 John Strtton, UIUC August -7

25 Summry Dsigning high-prformn, sll, nd numrilly stl lgorithms is hllnging Fst trnsposition nd dynmi tiling provids strong uilding loks W hv uilt th first high-prformn, sll, nd numril stl tri-digonl solvr mny-ors Mths th spd of CUSPARSE Surpsss th dt slility of CUSPARSE Mths numril stility of Intl MKL 5

26 THANK YOU! ANY QUESTIONS? 6

27 Nw Krnl Dvlopmnt Tools OpnACC Alrtor Prgms Widr us of GPU in lrg pplitions ut lss prformn in h krnl Cry nd othrs Portlnd Group CUDA FORTAN ompilr NVIDIA Thrust Mirosoft C++AMP

28 VAdd in OpnACC void omputa(flot *C, onst flot *A, onst flot *B, int n) { 4 #prgm prlll loop opyin(a[:n]) opyin(b[:n]) opyout(c[:n]) 5 for (int i=; i<n; i++) { 6 C[i] = A[i] + B[i]; 7 } 8 }

29 VAdd in C++AMP #inlud <mp.h> using nmsp onurrny; void vadd(flot* A, flot* B, flot* C, int n) { rry_viw<onst flot,> AV(n,A), BV(n,B); rry_viw<flot,> CV(n,C); CV.disrd_dt(); prlll_for_h(cv.xtnt, [=](indx<> i) rstrit(mp) { CV[i] = AV[i] + BV[i]; }); CV.synhroniz(); }

30 Thnk You!

31 Numril Stility Algorithms tht n lwys find n pproprit oprtion ordr nd thus finding solution to th prolm s long s it xists for ny givn input vlus r numrilly stl. Algorithms tht fll short r numrilly unstl. John Strtton, UIUC August -7

Portability, Scalability, and Numerical Stability in Accelerated Kernels

Portability, Scalability, and Numerical Stability in Accelerated Kernels Portility, Slility, nd Numril Stility in Alrtd Krnls John Strtton Dotorl Cndidt: Univrsity of Illinois t Urn-Chmpign Snior Arhitt: MultiorWr In Outlin Prformn Portility Wht CPU progrmmrs nd to lrn from