Performance, Scalability, and Numerical Stability of Manycore. Wen-mei Hwu University of Illinois at Urbana-Champaign
|
|
- Bruno Owen
- 5 years ago
- Views:
Transcription
1 Prformn, Slility, nd Numril Stility of Mnyor Algorithms Wn-mi Hwu Univrsity of Illinois t Urn-Chmpign
2 Cry XE6 Nods Blu Wtrs ontins,64 Cry XE6 omput nods. Dul-sokt Nod Two AMD Intrlgos hips 6 or moduls, 64 thrds GFs pk prformn 64 GBs mmory GB/s mmory ndwidth Gmini Intronnt Routr hip & ntwork intrf Injtion Bndwidth (pk) 9.6 GB/s pr dirtion
3 Cry XK7 Nods Blu Wtrs ontins,7 Cry XK7 omput nods. Dul-sokt Nod On AMD Intrlgos hip 8 or moduls, thrds 56.5 GFs pk prformn GBs mmory 5 GB/s ndwidth On NVIDIA Kplr hip. TFs pk prformn 6 GBs GDDR5 mmory 5 GB/s ndwidth Gmini Intronnt Sm s XE6 nods
4 NAMD Initil Prformn Rsults million tom nhmrk with Lngvin dynmis nd PME on vry 4 stps, from lunh to finish, ll I/O inludd 768 nods, Kplr+Intrlgos is.9x fstr ovr Intrlgos-only 768 nods, XK7 is.8x XE6 Chrom Ltti QCD prmtrs: grid siz of 48 x 5 running t th physil vlus of th qurk msss 768 nods, Kplr+Intrlgos is 4.9X fstr ovr Intrlgos-only 768 nods, XK7 is.4x XE6 QMCPACK Full run Grphit 4x4x (56 ltrons), QMC followd y VMC 7 nods, Kplr+Intrlgos is 4.9X fstr ovr Intrlgos-only 7 nods, XK7 is.7x XE6
5 Slility vs. Numril Stility A Mjor Algorithm Dsign Chllng 5 Prlllism Prlllism to fill growing HW prlllism Complxity nd dt slility Oprtions should grown linrly with dt siz Lolity DRAM ursts nd h sp utiliztion Rgulrity SIMD utiliztion nd lod ln Numril Stility Pivoting for linr systm solvrs
6 A Comprison of TDS on Mjor Pltforms 6 John Strtton, UIUC August -7
7 GPU Tridigonl Systm Solvr Cs Study Hyrid Mthods PCR-Thoms (Kim, Dvidson ) CR-PCR (CUSPARSE ) Et Numrilly unstl Thoms (squntil) Cyli Rdution ( stp) PCR ( stp)
8 Pivoting Judiiously swp rows to void d ss - - 8
9 Prolm Domposition SPIKE (Polizzi t l) A X = F A = DS D (SX) = F D Y = F (stp ) SX = Y (stp )
10 Forming S All i tils ll solvd in prlll
11 Put th stl squntil lgorithm insid h GPU thrd Eh thrd will pross on til y itslf with squntil, numrilly stl pivoting lgorithm Not tht h thrd ssing th first lmnt of its own til will rsult in lrg, stridd sss
12 Mmory Lyout Issu thrd thrd thrd thrd thrd thrd thrd thrd
13 GPU Mmory Bndwidth vs. Strid SAXPY with strid: y[i * strid ] = * x[ i * strid ] + y[i * strid ]; "Effiint Sprs Mtrix-Vtor Multiplition on CUDA" Nthn Bll nd Mihl Grlnd, in, "NVIDIA Thnil Rport NVR-8-4",,
14 Tils Prossd y Eh Thrd Eh til: Lyout of ll tils: (similr to ELL for trnsposition) 4
15 Anothr Dt Lyout Altrntiv ASTA divid into tils 5
16 6 ASTA Dt Lyout
17 In-pl Trnspostion Stp // dt[w][h]-->dt[h][w] prlll for (j<w) prlll for (i<h) flot tmp = dt[j][i]; //offst = j*h + i 7
18 In-pl Trnspostion: Brrir // dt[w][h]-->dt[h][w] prlll for (j<w) prlll for (i<h) flot tmp = dt[j][i]; //offst = j*h + i rrir(); 8
19 In-pl Trnspostion: Stp // dt[w][h]-->dt[h][w] prlll for (j<w) prlll for (i<h) flot tmp = dt[j][i]; //offst = j*h + i rrir(); dt[i][j] = tmp; //offst = i*w + j 9
20 AoS to ASTA Trnsformtion AoS to ASTA Mrshling Krnl Glol Mmory Throughput (GB/s) Fin Print Out-of-Pl 8 x Sp In-Pl Brrir Syn 95 Til Siz (tunl) < On-hip Mmory
21 Dynmi Tiling John Strtton, UIUC August -7
22 Cost nd Bnfit of ASTA Lyout Mrshling John Strtton, UIUC August -7
23 Error nd Stility John Strtton, UIUC August -7
24 Spd 4 John Strtton, UIUC August -7
25 Summry Dsigning high-prformn, sll, nd numrilly stl lgorithms is hllnging Fst trnsposition nd dynmi tiling provids strong uilding loks W hv uilt th first high-prformn, sll, nd numril stl tri-digonl solvr mny-ors Mths th spd of CUSPARSE Surpsss th dt slility of CUSPARSE Mths numril stility of Intl MKL 5
26 THANK YOU! ANY QUESTIONS? 6
27 Nw Krnl Dvlopmnt Tools OpnACC Alrtor Prgms Widr us of GPU in lrg pplitions ut lss prformn in h krnl Cry nd othrs Portlnd Group CUDA FORTAN ompilr NVIDIA Thrust Mirosoft C++AMP
28 VAdd in OpnACC void omputa(flot *C, onst flot *A, onst flot *B, int n) { 4 #prgm prlll loop opyin(a[:n]) opyin(b[:n]) opyout(c[:n]) 5 for (int i=; i<n; i++) { 6 C[i] = A[i] + B[i]; 7 } 8 }
29 VAdd in C++AMP #inlud <mp.h> using nmsp onurrny; void vadd(flot* A, flot* B, flot* C, int n) { rry_viw<onst flot,> AV(n,A), BV(n,B); rry_viw<flot,> CV(n,C); CV.disrd_dt(); prlll_for_h(cv.xtnt, [=](indx<> i) rstrit(mp) { CV[i] = AV[i] + BV[i]; }); CV.synhroniz(); }
30 Thnk You!
31 Numril Stility Algorithms tht n lwys find n pproprit oprtion ordr nd thus finding solution to th prolm s long s it xists for ny givn input vlus r numrilly stl. Algorithms tht fll short r numrilly unstl. John Strtton, UIUC August -7
Portability, Scalability, and Numerical Stability in Accelerated Kernels
Portility, Slility, nd Numril Stility in Alrtd Krnls John Strtton Dotorl Cndidt: Univrsity of Illinois t Urn-Chmpign Snior Arhitt: MultiorWr In Outlin Prformn Portility Wht CPU progrmmrs nd to lrn from
More informationReview: Binary Trees. CSCI 262 Data Structures. Search Trees. In Order Traversal. Binary Search Trees 4/10/2018. Review: Binary Tree Implementation
Rviw: Binry Trs CSCI 262 Dt Struturs 21 Binry Srh Trs A inry tr is in rursivly: = or A inry tr is (mpty) root no with lt hil n riht hil, h o whih is inry tr. Rviw: Binry Tr Implmnttion Just ollow th rursiv
More informationA Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing
A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul
More informationMERGE-BASED SpMV PERFECT WORKLOAD BALANCE. GUARANTEED. Duane Merrill, NVIDIA Research
MERGE-BASED SpMV PERFECT WORKLOAD BALANCE. GUARANTEED. Dun Mrrill, NVIDIA Rsr SPARSE MATRIX-VECTOR MULTIPLICATION SpMV (Ax = y) -- -- -- -- -- -- -- -- * = 2.0 0.0 2.0 4.0 sprs mtrix A ns vtor x ns vtor
More informationVAT GX - IP VIDEO FIELD ADD-ON/RETROFIT SPECIFICATIONS PERSPECTIVE SIDE VIEW MOUNTING CONNECTIONS CALL SINGLE CHANNEL ENCODER
VT GX - IP VIDO ILD DD-ON/RTROIT L00 LL -800-999-600 SINGL HNNL NODR SPIITIONS NTWORK ONNTORS SING: MTRIL: P + BS X7240 OLOR: DRK BLU DIMNSIONS IN MILLIMTRS (DIMNSIONS IN INHS) SUSTINBILITY: MMORY: PV
More informationVAT GX - IP VIDEO FIELD ADD-ON/RETROFIT SINGLE CHANNEL ENCODER
L00 LL -800-999-600 VT GX - IP VIDO ILD DD-ON/RTROIT SINGL HNNL NODR NTWORK ONNTORS SING: MTRIL: P + BS X7240 OLOR: DRK BLU SUSTINBILITY: MMORY: PV R 256 MB RM, 256 MB LSH BTTRY BKD- RL TIM LOK POWR: POWR
More informationA Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois
A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- How to Build a gtsv for Performance
More informationHistory Rgistr Allotion Exmpl As ol s intrmit o Consir this progrm with six vrils: := + := + := - 1 Us in th originl FORTRAN ompilr (1950 s) Vry ru lg
Th Mmory Hirrhy Avn Compilrs CMPSCI 710 Spring 2003 Highr = smllr, str, losr to CPU A rl sktop mhin (min) Rgistr Allotion Emry Brgr rgistrs 8 intgr, 8 loting-point; 1-yl ltny L1 h 8K t & instrutions; 2-yl
More informationLecture Outline. Memory Hierarchy Management. Register Allocation. Register Allocation. Lecture 38. Cache Management. Managing the Memory Hierarchy
Ltur Outlin Mmory Hirrhy Mngmnt Rgistr Allotion Ltu8 (rom nots y G. Nul n R. Boik) Rgistr Allotion Rgistr intrrn grph Grph oloring huristis Spilling Ch Mngmnt 4/27/08 Pro. Hilingr CS164 Ltu8 1 4/27/08
More informationOverview Linear Algebra Review Linear Algebra Review. What is a Matrix? Additional Resources. Basic Operations.
Oriw Ro Jnow Mon, Sptmr 2, 24 si mtri oprtions (, -, *) Cross n ot prouts Dtrminnts n inrss Homonous oorints Ortonorml sis itionl Rsours 8.6 Tt ook 6.837 Tt ook 6.837-stff@rpis.sil.mit.u Ck t ours wsit
More informationLecture Outline. Memory Hierarchy Management. Register Allocation. Register Allocation. Lecture 19. Cache Management. The Memory Hierarchy
Ltur Outlin Mmory Hirrhy Mngmnt Rgistr Allotion Ltur 19 Rgistr Allotion Rgistr intrrn grph Grph oloring huristis Spilling Ch Mngmnt Pro. Boik CS 164 Ltur 17 1 Pro. Boik CS 164 Ltur 17 2 Th Mmory Hirrhy
More informationWORKSHOP 2 Solid Shell Composites Modeling
WORKSHOP 2 Soli Shll Composits Moling WS2-1 WS2-2 Workshop Ojtivs Bom fmilir with stting up soli omposit shll mol Softwr Vrsion Ptrn 2011 MD Nstrn 2011.1 Fils Rquir soli_shll. WS2-3 Prolm Dsription Simult
More informationFinding a Funicular Curve Through Two Points
This is th glss pyrmi t th Louvr Musum in Pris, sign y rhitt I.M. Pi. It is support from nth y stl ls. In signing strutur suh s this, it is oftn most usful to slt l of rtin siz n tnsil strngth, n thn to
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationCompiling a Parallel DSL to GPU
Compiling Prllel DSL to GPU Rmesh Nrynswmy Bdri Gopln Synopsys In. Synopsys 2012 1 Agend Overview of Verilog Simultion Prllel Verilog Simultion Algorithms Prllel Simultion Trdeoffs on GPU Chllenges Synopsys
More informationThe Network Layer: Routing Algorithms. The Network Layer: Routing & Addressing Outline
PS 6 Ntwork Programming Th Ntwork Layr: Routing lgorithms Michl Wigl partmnt of omputr Scinc lmson Univrsity mwigl@cs.clmson.du http://www.cs.clmson.du/~mwigl/courss/cpsc6 Th Ntwork Layr: Routing & ddrssing
More informationCPSC 826 Internetworking. The Network Layer: Routing & Addressing Outline. The Network Layer: Routing Algorithms. Routing Algorithms Taxonomy
PS Intrntworking Th Ntwork Layr: Routing & ddrssing Outlin Th Ntwork Layr: Routing lgorithms Michl Wigl partmnt of omputr Scinc lmson Univrsity mwigl@cs.clmson.du Novmbr, Ntwork layr functions Routr architctur
More informationMoving Towards Exascale with Lessons Learned from GPU Computing. Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urbana-Champaign
Moving Towards Exascale with Lessons Learned from GPU Computing Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urana-Champaign Agenda Blue Waters and recent progress in petascale GPU computing
More informationcisc1110 fall 2010 lecture VI.2 call by value function parameters another call by value example:
cisc1110 fll 2010 lecture VI.2 cll y vlue function prmeters more on functions more on cll y vlue nd cll y reference pssing strings to functions returning strings from functions vrile scope glol vriles
More informationCS 331: Artificial Intelligence Bayesian Networks (Inference) Inference
S 331: rtificil Intllignc ysin Ntworks Infrnc 1 Infrnc Suppos you r givn ysin ntwork with th grph structur n th prmtrs ll figur out Now you woul lik to us it to o infrnc You n infrnc to mk prictions or
More informationGlobal Register Allocation
Ltur Outlin Glol Rgistr Allotion Mmory Hirrhy Mngmnt Rgistr Allotion vi Grph Coloring Rgistr intrrn grph Grph oloring huristis Spilling Ch Mngmnt 2 Th Mmory Hirrhy Rgistrs 1 yl 256-8000 yts Ch 3 yls 256k-16M
More informationInternet Routing. IP Packet Format. IP Fragmentation & Reassembly. Principles of Internet Routing. Computer Networks 9/29/2014.
omputer Networks 9/29/2014 IP Pket Formt Internet Routing Ki Shen IP protool version numer heder length (words) for qulity of servie mx numer remining hops (deremented t eh router) upper lyer protool to
More informationRethinking Computer Architecture for Energy Limited Computing
Rethinking Computer Architecture for Energy Limited Computing Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urana-Champaign Agenda Blue Waters and recent progress in petascale GPU computing
More informationEnterprise Digital Signage Create a New Sign
Enterprise Digitl Signge Crete New Sign Intended Audiene: Content dministrtors of Enterprise Digitl Signge inluding stff with remote ess to sign.pitt.edu nd the Content Mnger softwre pplition for their
More informationSolution of Linear Algebraic Equations using the Gauss-Jordan Method
Solution of Liner Algebric Equtions using the Guss-Jordn Method Populr pproch for solving liner equtions The Guss Jordn method depends on two properties of liner equtions: Scling one or more of ny of the
More informationCMPUT101 Introduction to Computing - Summer 2002
CMPUT Introdution to Computing - Summer 22 %XLOGLQJ&RPSXWHU&LUFXLWV Chpter 4.4 3XUSRVH We hve looked t so fr how to uild logi gtes from trnsistors. Next we will look t how to uild iruits from logi gtes,
More informationParadigm 5. Data Structure. Suffix trees. What is a suffix tree? Suffix tree. Simple applications. Simple applications. Algorithms
Prdigm. Dt Struture Known exmples: link tble, hep, Our leture: suffix tree Will involve mortize method tht will be stressed shortly in this ourse Suffix trees Wht is suffix tree? Simple pplitions History
More informationCOSC 6374 Parallel Computation. Non-blocking Collective Operations. Edgar Gabriel Fall Overview
COSC 6374 Prllel Computtion Non-loking Colletive Opertions Edgr Griel Fll 2014 Overview Impt of olletive ommunition opertions Impt of ommunition osts on Speedup Crtesin stenil ommunition All-to-ll ommunition
More informationThis module calculates the motor speed based on a rotor position measurement when the direction information is available.
SPEED_FRQ Spd Calulator Basd on Rotor Angl With Dirtion Information Dsription This modul alulats th motor spd basd on a rotor position masurmnt whn th dirtion information is availabl. thta_l dir_qep SPEED_FRQ
More informationASSIGNMENT 9: CACHE MEMORY NAME. Assume we are building a cache for a memory system that s just 16 bytes big 4 address bits.
. SSIGNMNT : H MMORY NM PROLM : -YT H OR -YT MMORY. ssume we are building a cache for a memory system that s just bytes big address bits. We will make a direct mapped cache that has four set, so there
More informationIn order to learn which questions have been answered correctly: 1. Print these pages. 2. Answer the questions.
XML: Mnging with th Jv Pltform In ordr to lrn whih qustions hv n nswrd orrtly: 1. Print ths pgs. 2. Answr th qustions. 3. Snd this ssssmnt with th nswrs vi:. FAX to (212) 967-3498. Or. Mil th nswrs to
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationLists in Lisp and Scheme
Lists in Lisp nd Scheme Lists in Lisp nd Scheme Lists re Lisp s fundmentl dt structures, ut there re others Arrys, chrcters, strings, etc. Common Lisp hs moved on from eing merely LISt Processor However,
More informationCOSC 6374 Parallel Computation. Dense Matrix Operations
COSC 6374 Prllel Computtion Dense Mtrix Opertions Edgr Griel Fll Edgr Griel Prllel Computtion Edgr Griel erminology Dense Mtrix: ll elements of the mtrix ontin relevnt vlues ypilly stored s 2-D rry, (e.g.
More informationString comparison by transposition networks
String omprison y trnsposition networks Alexnder Tiskin (Joint work with Peter Krushe) Deprtment of Computer Siene University of Wrwik http://www.ds.wrwik..uk/~tiskin (inludes n extended version of this
More informationPreparing GPU-Accelerated Applications for the Summit Supercomputer
Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership
More informationHigh Performance Matrix-matrix Multiplication of Very Small Matrices
High Performance Matrix-matrix Multiplication of Very Small Matrices Ian Masliah, Marc Baboulin, ICL people University Paris-Sud - LRI Sparse Days Cerfacs, Toulouse, 1/07/2016 Context Tensor Contractions
More informationBigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu
Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters
More informationIntroduction to CUDA (1 of n*)
Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationGPU Supercomputing From Blue Waters to Exascale
GPU Supercomputing From Blue Waters to Exascale Wen-mei Hwu Professor, University of Illinois at Urbana-Champaign (UIUC) CTO, MulticoreWare Inc. New BW Configuration Cray System & Storage cabinets: Compute
More informationCOSC 6374 Parallel Computation. Communication Performance Modeling (II) Edgar Gabriel Fall Overview. Impact of communication costs on Speedup
COSC 6374 Prllel Computtion Communition Performne Modeling (II) Edgr Griel Fll 2015 Overview Impt of ommunition osts on Speedup Crtesin stenil ommunition All-to-ll ommunition Impt of olletive ommunition
More informationWhy are GPUs so hard to program or are they? Wen-mei Hwu University of Illinois, Urbana-Champaign MulticoreWare
Why are GPUs so hard to program or are they? Wen-mei Hwu University of Illinois, Urbana-Champaign MulticoreWare Agenda GPU Computing in Blue Waters Library Algorithms Scalability, performance, and numerical
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationContainers: Queue and List
Continers: Queue n List Queue A ontiner in whih insertion is one t one en (the til) n eletion is one t the other en (the he). Also lle FIFO (First-In, First-Out) Jori Cortell n Jori Petit Deprtment of
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationScalability, Portability, and Productivity in GPU Computing
Scalability, Portability, and Productivity in GPU Computing Wen-mei Hwu Sanders AMD Chair, ECE and CS University of Illinois, Urbana-Champaign CTO, MulticoreWare Agenda Performance experience in using
More informationHSHM-H110AX-5CPX HSHM-H105BX-5CPX TYPE B21, 105 SIGNAL CONTACTS HSHM-HXXXXXX-5CPX-XXXXX
M TM HSHM PRSS-FIT HR, -ROW, HSHM SRIS FOR HIGH SP HR MTRI PPLITIONS * UP TO Gb/s T RTS * LOW ROSSTLK T HIGH FRQUNIS * / (SINGL-N/IFFRNTIL) IMPN * MOULR/SLL FORMT I -- * MT LINS PR INH * SHIPS WITH PROTTIV
More informationIdentifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011
Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for
More informationOpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016
OpenACC. Part 2 Ned Nedialkov McMaster University Canada CS/SE 4F03 March 2016 Outline parallel construct Gang loop Worker loop Vector loop kernels construct kernels vs. parallel Data directives c 2013
More informationCharacteristics of Fault Simulation. Fault Simulation Techniques. Parallel Fault Simulation. Parallel Fault Simulation
Chrtristis o Fult Simultion Fult tivity with rspt to ult-r iruit is otn sprs oth in tim n sp. For mpl F is not tivt y th givn pttrn, whil F2 ts only th lowr prt o this iruit. Fult Simultion Thniqus Prlll
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationIntroducing fractions
Introduing frtions Nme Colour hlf of eh shpe: Show the following fr ons: out of out of out of Lel these fr ons: Shde these fr ons: 7 0 Represents ommon fr ons on different models Interprets the numertor
More informationParalization on GPU using CUDA An Introduction
Paralization on GPU using CUDA An Introduction Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011 Outline 1 Introduction to GPU 2 Introduction to CUDA Graphics Processing
More informationS675, S750 Stretchair Parts List A D
Page of 7 F G PRTS LIST Number Part Number escription M675-038 STR, 6" NTRL LOKING, TOTL LK TWIN WHL M675-04 STR, 6" NTRL LOKING, IR LK, TWIN WHL S-HX-ZP-M6-0 SRW, HX H, M6 X 0 MM LG 4 W-LI-ZP-5-47-03
More informationBelow, are instructions about how to set each goal and report achievements in Your Club, Service, and Foundation Giving.
Rotry Clu Cntrl is n onlin tool to hlp lus st nd trk lu gols nd hivmnts. This rfrn guid outlins th stps you nd to tk to st nd dit gols s wll s rport hivmnts in Rotry Clu Cntrl. If ny dt is displyd inorrtly,
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationChapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs
Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs Li-Wen Chang and Wen-mei W. Hwu 2.1 Introduction The tridiagonal solver has been recognized as a critical building block for many engineering
More informationECO GUIDE TO Unstratified Samples
ECO GUIDE TO Unstrtifid Smpls Wht Is n Unstrtifid Smpld? If you hv didd to ondut smpl invntory, you will b ollting dt for plots lotd throughout your study r. In this typ of projt, you n hoos to strtify
More informationISO VIEW COVER, EXPRESS EXIT 4X4 FLIP COVER OPEN VIEW EXPRESS EXIT ON TROUGH VIEW
RV MO WN T 00899MO OVL 07-JN-5 0078MO HUH 7-SP-5 5.90 RF.87 RF.000 RF ISO VIW SL OVR, 0.50 OVR XTNSION X FLIP OVR FGS-MX-- (NOT INLU IN KIT).07 RF 7.7 RF FLIP OVR OPN VIW SL X STRIGHT STION RF RKT, XPRSS
More informationGPU CUDA Programming
GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationMattan Erez. The University of Texas at Austin
EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and
More informationLINX MATRIX SWITCHERS FIRMWARE UPDATE INSTRUCTIONS FIRMWARE VERSION
Overview LINX MATRIX SWITCHERS FIRMWARE UPDATE INSTRUCTIONS FIRMWARE VERSION 4.4.1.0 Due to the omplex nture of this updte, plese fmilirize yourself with these instrutions nd then ontt RGB Spetrum Tehnil
More informationCSE P 501 Compilers. Register Allocation Hal Perkins Spring UW CSE P 501 Spring 2018 P-1
CSE P 501 Compilrs Rgistr Allotion Hl Prkins Spring 2018 UW CSE P 501 Spring 2018 P-1 Agn Rgistr llotion onstrints Lol mthos Fstr ompil, slowr o, ut goo nough or lots o things (JITs, ) Glol llotion rgistr
More informationS4289: Efficient solution of multiple scalar and block-tridiagonal equations
S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] oerc.ox.ac.uk Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University,
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationMinimal Memory Abstractions
Miniml Memory Astrtions (As implemented for BioWre Corp ) Nthn Sturtevnt University of Alert GAMES Group Ferury, 7 Tlk Overview Prt I: Building Astrtions Minimizing memory requirements Performnes mesures
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More information6.045J/18.400J: Automata, Computability and Complexity. Quiz 2: Solutions. Please write your name in the upper corner of each page.
6045J/18400J: Automt, Computbility nd Complexity Mrh 30, 2005 Quiz 2: Solutions Prof Nny Lynh Vinod Vikuntnthn Plese write your nme in the upper orner of eh pge Problem Sore 1 2 3 4 5 6 Totl Q2-1 Problem
More informationCUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)
CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write
More informationCSE 599 I Accelerated Computing - Programming GPUS. Memory performance
CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth
More informationScalability, Portability, and Productivity in GPU Computing
Scalability, Portability, and Productivity in GPU Computing Wen-mei Hwu Sanders AMD Chair, ECE and CS University of Illinois, Urbana-Champaign CTO, MulticoreWare Agenda 4,224 Kepler GPUs in Blue Waters
More informationNVIDIA Fermi Architecture
Administrivia NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 4 grades returned Project checkpoint on Monday Post an update on your blog beforehand Poster
More information1. Trace the array for Bubble sort 34, 8, 64, 51, 32, 21. And fill in the following table
1. Trac th array for Bubbl sort 34, 8, 64, 51, 3, 1. And fill in th following tabl bubbl(intgr Array x, Intgr n) Stp 1: Intgr hold, j, pass; Stp : Boolan switchd = TRUE; Stp 3: for pass = 0 to (n - 1 &&
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationA New Algorithm for Solving Shortest Path Problem on a Network with Imprecise Edge Weight
Availabl at http://pvamudu/aam Appl Appl Math ISSN: 193-9466 Vol 6, Issu (Dcmbr 011), pp 60 619 Applications and Applid Mathmatics: An Intrnational Journal (AAM) A Nw Algorithm for Solving Shortst Path
More informationCS 241 Week 4 Tutorial Solutions
CS 4 Week 4 Tutoril Solutions Writing n Assemler, Prt & Regulr Lnguges Prt Winter 8 Assemling instrutions utomtilly. slt $d, $s, $t. Solution: $d, $s, nd $t ll fit in -it signed integers sine they re 5-it
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More informationTYPICAL RAISED POSITION
UPPR 1. TH LOTION OF RMP LOSUR GTS ND MOUNTING HIGHT OF PIVOT SHLL VRIFID Y TH NGINR.. HIGHT OF GUIDS MY VRID S RQUIRD FOR WRNING LIGHT LRN. 3. FIRGLSS/LUMINUM ND SHLL SUPPLID Y TH SM VNDOR. 4. TO MOUNTD
More informationCUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University
GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014 Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each
More informationIntroduc)on to GPU Programming
Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance
More informationLesson 4.4. Euler Circuits and Paths. Explore This
Lesson 4.4 Euler Ciruits nd Pths Now tht you re fmilir with some of the onepts of grphs nd the wy grphs onvey onnetions nd reltionships, it s time to egin exploring how they n e used to model mny different
More informationNumerical Simulation on the GPU
Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle
More informationRECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016
RECENT TRENDS IN GPU ARCHITECTURES Perspectives of GPU computing in Science, 26 th Sept 2016 NVIDIA THE AI COMPUTING COMPANY GPU Computing Computer Graphics Artificial Intelligence 2 NVIDIA POWERS WORLD
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationThe Size of the 3D Visibility Skeleton: Analysis and Application
Th Siz of th 3D Visibility Sklton: Analysis and Application Ph.D. thsis proposal Linqiao Zhang lzhang15@cs.mcgill.ca School of Computr Scinc, McGill Univrsity March 20, 2008 thsis proposal: Th Siz of th
More informationRegister file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.
Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationCompilers. Topic 4. The Symbol Table and Block Structure PART II. Mick O Donnell: Alfonso Ortega:
Compilers Topi 4 The ol Tle nd Blok Struture PART II Mik O Donnell: mihel.odonnell@um.es Alfonso Orteg: lfonso.orteg@um.es Topi 2: Blok Struture 2 1 ol tles with lok strutures Blok Struture Progrmming
More informationZZ - Advanced Math Review 2017
ZZ - Advnced Mth Review Mtrix Multipliction Given! nd! find the sum of the elements of the product BA First, rewrite the mtrices in the correct order to multiply The product is BA hs order x since B is
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationThe Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015
The Rise of Open Programming Frameworks JC BARATAULT IWOCL May 2015 1,000+ OpenCL projects SourceForge GitHub Google Code BitBucket 2 TUM.3D Virtual Wind Tunnel 10K C++ lines of code, 30 GPU kernels CUDA
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More information