Performance, Scalability, and Numerical Stability of Manycore. Wen-mei Hwu University of Illinois at Urbana-Champaign

Size: px
Start display at page:

Download "Performance, Scalability, and Numerical Stability of Manycore. Wen-mei Hwu University of Illinois at Urbana-Champaign"

Transcription

1 Prformn, Slility, nd Numril Stility of Mnyor Algorithms Wn-mi Hwu Univrsity of Illinois t Urn-Chmpign

2 Cry XE6 Nods Blu Wtrs ontins,64 Cry XE6 omput nods. Dul-sokt Nod Two AMD Intrlgos hips 6 or moduls, 64 thrds GFs pk prformn 64 GBs mmory GB/s mmory ndwidth Gmini Intronnt Routr hip & ntwork intrf Injtion Bndwidth (pk) 9.6 GB/s pr dirtion

3 Cry XK7 Nods Blu Wtrs ontins,7 Cry XK7 omput nods. Dul-sokt Nod On AMD Intrlgos hip 8 or moduls, thrds 56.5 GFs pk prformn GBs mmory 5 GB/s ndwidth On NVIDIA Kplr hip. TFs pk prformn 6 GBs GDDR5 mmory 5 GB/s ndwidth Gmini Intronnt Sm s XE6 nods

4 NAMD Initil Prformn Rsults million tom nhmrk with Lngvin dynmis nd PME on vry 4 stps, from lunh to finish, ll I/O inludd 768 nods, Kplr+Intrlgos is.9x fstr ovr Intrlgos-only 768 nods, XK7 is.8x XE6 Chrom Ltti QCD prmtrs: grid siz of 48 x 5 running t th physil vlus of th qurk msss 768 nods, Kplr+Intrlgos is 4.9X fstr ovr Intrlgos-only 768 nods, XK7 is.4x XE6 QMCPACK Full run Grphit 4x4x (56 ltrons), QMC followd y VMC 7 nods, Kplr+Intrlgos is 4.9X fstr ovr Intrlgos-only 7 nods, XK7 is.7x XE6

5 Slility vs. Numril Stility A Mjor Algorithm Dsign Chllng 5 Prlllism Prlllism to fill growing HW prlllism Complxity nd dt slility Oprtions should grown linrly with dt siz Lolity DRAM ursts nd h sp utiliztion Rgulrity SIMD utiliztion nd lod ln Numril Stility Pivoting for linr systm solvrs

6 A Comprison of TDS on Mjor Pltforms 6 John Strtton, UIUC August -7

7 GPU Tridigonl Systm Solvr Cs Study Hyrid Mthods PCR-Thoms (Kim, Dvidson ) CR-PCR (CUSPARSE ) Et Numrilly unstl Thoms (squntil) Cyli Rdution ( stp) PCR ( stp)

8 Pivoting Judiiously swp rows to void d ss - - 8

9 Prolm Domposition SPIKE (Polizzi t l) A X = F A = DS D (SX) = F D Y = F (stp ) SX = Y (stp )

10 Forming S All i tils ll solvd in prlll

11 Put th stl squntil lgorithm insid h GPU thrd Eh thrd will pross on til y itslf with squntil, numrilly stl pivoting lgorithm Not tht h thrd ssing th first lmnt of its own til will rsult in lrg, stridd sss

12 Mmory Lyout Issu thrd thrd thrd thrd thrd thrd thrd thrd

13 GPU Mmory Bndwidth vs. Strid SAXPY with strid: y[i * strid ] = * x[ i * strid ] + y[i * strid ]; "Effiint Sprs Mtrix-Vtor Multiplition on CUDA" Nthn Bll nd Mihl Grlnd, in, "NVIDIA Thnil Rport NVR-8-4",,

14 Tils Prossd y Eh Thrd Eh til: Lyout of ll tils: (similr to ELL for trnsposition) 4

15 Anothr Dt Lyout Altrntiv ASTA divid into tils 5

16 6 ASTA Dt Lyout

17 In-pl Trnspostion Stp // dt[w][h]-->dt[h][w] prlll for (j<w) prlll for (i<h) flot tmp = dt[j][i]; //offst = j*h + i 7

18 In-pl Trnspostion: Brrir // dt[w][h]-->dt[h][w] prlll for (j<w) prlll for (i<h) flot tmp = dt[j][i]; //offst = j*h + i rrir(); 8

19 In-pl Trnspostion: Stp // dt[w][h]-->dt[h][w] prlll for (j<w) prlll for (i<h) flot tmp = dt[j][i]; //offst = j*h + i rrir(); dt[i][j] = tmp; //offst = i*w + j 9

20 AoS to ASTA Trnsformtion AoS to ASTA Mrshling Krnl Glol Mmory Throughput (GB/s) Fin Print Out-of-Pl 8 x Sp In-Pl Brrir Syn 95 Til Siz (tunl) < On-hip Mmory

21 Dynmi Tiling John Strtton, UIUC August -7

22 Cost nd Bnfit of ASTA Lyout Mrshling John Strtton, UIUC August -7

23 Error nd Stility John Strtton, UIUC August -7

24 Spd 4 John Strtton, UIUC August -7

25 Summry Dsigning high-prformn, sll, nd numrilly stl lgorithms is hllnging Fst trnsposition nd dynmi tiling provids strong uilding loks W hv uilt th first high-prformn, sll, nd numril stl tri-digonl solvr mny-ors Mths th spd of CUSPARSE Surpsss th dt slility of CUSPARSE Mths numril stility of Intl MKL 5

26 THANK YOU! ANY QUESTIONS? 6

27 Nw Krnl Dvlopmnt Tools OpnACC Alrtor Prgms Widr us of GPU in lrg pplitions ut lss prformn in h krnl Cry nd othrs Portlnd Group CUDA FORTAN ompilr NVIDIA Thrust Mirosoft C++AMP

28 VAdd in OpnACC void omputa(flot *C, onst flot *A, onst flot *B, int n) { 4 #prgm prlll loop opyin(a[:n]) opyin(b[:n]) opyout(c[:n]) 5 for (int i=; i<n; i++) { 6 C[i] = A[i] + B[i]; 7 } 8 }

29 VAdd in C++AMP #inlud <mp.h> using nmsp onurrny; void vadd(flot* A, flot* B, flot* C, int n) { rry_viw<onst flot,> AV(n,A), BV(n,B); rry_viw<flot,> CV(n,C); CV.disrd_dt(); prlll_for_h(cv.xtnt, [=](indx<> i) rstrit(mp) { CV[i] = AV[i] + BV[i]; }); CV.synhroniz(); }

30 Thnk You!

31 Numril Stility Algorithms tht n lwys find n pproprit oprtion ordr nd thus finding solution to th prolm s long s it xists for ny givn input vlus r numrilly stl. Algorithms tht fll short r numrilly unstl. John Strtton, UIUC August -7

Portability, Scalability, and Numerical Stability in Accelerated Kernels

Portability, Scalability, and Numerical Stability in Accelerated Kernels Portility, Slility, nd Numril Stility in Alrtd Krnls John Strtton Dotorl Cndidt: Univrsity of Illinois t Urn-Chmpign Snior Arhitt: MultiorWr In Outlin Prformn Portility Wht CPU progrmmrs nd to lrn from

More information

Review: Binary Trees. CSCI 262 Data Structures. Search Trees. In Order Traversal. Binary Search Trees 4/10/2018. Review: Binary Tree Implementation

Review: Binary Trees. CSCI 262 Data Structures. Search Trees. In Order Traversal. Binary Search Trees 4/10/2018. Review: Binary Tree Implementation Rviw: Binry Trs CSCI 262 Dt Struturs 21 Binry Srh Trs A inry tr is in rursivly: = or A inry tr is (mpty) root no with lt hil n riht hil, h o whih is inry tr. Rviw: Binry Tr Implmnttion Just ollow th rursiv

More information

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul

More information

MERGE-BASED SpMV PERFECT WORKLOAD BALANCE. GUARANTEED. Duane Merrill, NVIDIA Research

MERGE-BASED SpMV PERFECT WORKLOAD BALANCE. GUARANTEED. Duane Merrill, NVIDIA Research MERGE-BASED SpMV PERFECT WORKLOAD BALANCE. GUARANTEED. Dun Mrrill, NVIDIA Rsr SPARSE MATRIX-VECTOR MULTIPLICATION SpMV (Ax = y) -- -- -- -- -- -- -- -- * = 2.0 0.0 2.0 4.0 sprs mtrix A ns vtor x ns vtor

More information

VAT GX - IP VIDEO FIELD ADD-ON/RETROFIT SPECIFICATIONS PERSPECTIVE SIDE VIEW MOUNTING CONNECTIONS CALL SINGLE CHANNEL ENCODER

VAT GX - IP VIDEO FIELD ADD-ON/RETROFIT SPECIFICATIONS PERSPECTIVE SIDE VIEW MOUNTING CONNECTIONS CALL SINGLE CHANNEL ENCODER VT GX - IP VIDO ILD DD-ON/RTROIT L00 LL -800-999-600 SINGL HNNL NODR SPIITIONS NTWORK ONNTORS SING: MTRIL: P + BS X7240 OLOR: DRK BLU DIMNSIONS IN MILLIMTRS (DIMNSIONS IN INHS) SUSTINBILITY: MMORY: PV

More information

VAT GX - IP VIDEO FIELD ADD-ON/RETROFIT SINGLE CHANNEL ENCODER

VAT GX - IP VIDEO FIELD ADD-ON/RETROFIT SINGLE CHANNEL ENCODER L00 LL -800-999-600 VT GX - IP VIDO ILD DD-ON/RTROIT SINGL HNNL NODR NTWORK ONNTORS SING: MTRIL: P + BS X7240 OLOR: DRK BLU SUSTINBILITY: MMORY: PV R 256 MB RM, 256 MB LSH BTTRY BKD- RL TIM LOK POWR: POWR

More information

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- How to Build a gtsv for Performance

More information

History Rgistr Allotion Exmpl As ol s intrmit o Consir this progrm with six vrils: := + := + := - 1 Us in th originl FORTRAN ompilr (1950 s) Vry ru lg

History Rgistr Allotion Exmpl As ol s intrmit o Consir this progrm with six vrils: := + := + := - 1 Us in th originl FORTRAN ompilr (1950 s) Vry ru lg Th Mmory Hirrhy Avn Compilrs CMPSCI 710 Spring 2003 Highr = smllr, str, losr to CPU A rl sktop mhin (min) Rgistr Allotion Emry Brgr rgistrs 8 intgr, 8 loting-point; 1-yl ltny L1 h 8K t & instrutions; 2-yl

More information

Lecture Outline. Memory Hierarchy Management. Register Allocation. Register Allocation. Lecture 38. Cache Management. Managing the Memory Hierarchy

Lecture Outline. Memory Hierarchy Management. Register Allocation. Register Allocation. Lecture 38. Cache Management. Managing the Memory Hierarchy Ltur Outlin Mmory Hirrhy Mngmnt Rgistr Allotion Ltu8 (rom nots y G. Nul n R. Boik) Rgistr Allotion Rgistr intrrn grph Grph oloring huristis Spilling Ch Mngmnt 4/27/08 Pro. Hilingr CS164 Ltu8 1 4/27/08

More information

Overview Linear Algebra Review Linear Algebra Review. What is a Matrix? Additional Resources. Basic Operations.

Overview Linear Algebra Review Linear Algebra Review. What is a Matrix? Additional Resources. Basic Operations. Oriw Ro Jnow Mon, Sptmr 2, 24 si mtri oprtions (, -, *) Cross n ot prouts Dtrminnts n inrss Homonous oorints Ortonorml sis itionl Rsours 8.6 Tt ook 6.837 Tt ook 6.837-stff@rpis.sil.mit.u Ck t ours wsit

More information

Lecture Outline. Memory Hierarchy Management. Register Allocation. Register Allocation. Lecture 19. Cache Management. The Memory Hierarchy

Lecture Outline. Memory Hierarchy Management. Register Allocation. Register Allocation. Lecture 19. Cache Management. The Memory Hierarchy Ltur Outlin Mmory Hirrhy Mngmnt Rgistr Allotion Ltur 19 Rgistr Allotion Rgistr intrrn grph Grph oloring huristis Spilling Ch Mngmnt Pro. Boik CS 164 Ltur 17 1 Pro. Boik CS 164 Ltur 17 2 Th Mmory Hirrhy

More information

WORKSHOP 2 Solid Shell Composites Modeling

WORKSHOP 2 Solid Shell Composites Modeling WORKSHOP 2 Soli Shll Composits Moling WS2-1 WS2-2 Workshop Ojtivs Bom fmilir with stting up soli omposit shll mol Softwr Vrsion Ptrn 2011 MD Nstrn 2011.1 Fils Rquir soli_shll. WS2-3 Prolm Dsription Simult

More information

Finding a Funicular Curve Through Two Points

Finding a Funicular Curve Through Two Points This is th glss pyrmi t th Louvr Musum in Pris, sign y rhitt I.M. Pi. It is support from nth y stl ls. In signing strutur suh s this, it is oftn most usful to slt l of rtin siz n tnsil strngth, n thn to

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

Compiling a Parallel DSL to GPU

Compiling a Parallel DSL to GPU Compiling Prllel DSL to GPU Rmesh Nrynswmy Bdri Gopln Synopsys In. Synopsys 2012 1 Agend Overview of Verilog Simultion Prllel Verilog Simultion Algorithms Prllel Simultion Trdeoffs on GPU Chllenges Synopsys

More information

The Network Layer: Routing Algorithms. The Network Layer: Routing & Addressing Outline

The Network Layer: Routing Algorithms. The Network Layer: Routing & Addressing Outline PS 6 Ntwork Programming Th Ntwork Layr: Routing lgorithms Michl Wigl partmnt of omputr Scinc lmson Univrsity mwigl@cs.clmson.du http://www.cs.clmson.du/~mwigl/courss/cpsc6 Th Ntwork Layr: Routing & ddrssing

More information

CPSC 826 Internetworking. The Network Layer: Routing & Addressing Outline. The Network Layer: Routing Algorithms. Routing Algorithms Taxonomy

CPSC 826 Internetworking. The Network Layer: Routing & Addressing Outline. The Network Layer: Routing Algorithms. Routing Algorithms Taxonomy PS Intrntworking Th Ntwork Layr: Routing & ddrssing Outlin Th Ntwork Layr: Routing lgorithms Michl Wigl partmnt of omputr Scinc lmson Univrsity mwigl@cs.clmson.du Novmbr, Ntwork layr functions Routr architctur

More information

Moving Towards Exascale with Lessons Learned from GPU Computing. Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urbana-Champaign

Moving Towards Exascale with Lessons Learned from GPU Computing. Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urbana-Champaign Moving Towards Exascale with Lessons Learned from GPU Computing Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urana-Champaign Agenda Blue Waters and recent progress in petascale GPU computing

More information

cisc1110 fall 2010 lecture VI.2 call by value function parameters another call by value example:

cisc1110 fall 2010 lecture VI.2 call by value function parameters another call by value example: cisc1110 fll 2010 lecture VI.2 cll y vlue function prmeters more on functions more on cll y vlue nd cll y reference pssing strings to functions returning strings from functions vrile scope glol vriles

More information

CS 331: Artificial Intelligence Bayesian Networks (Inference) Inference

CS 331: Artificial Intelligence Bayesian Networks (Inference) Inference S 331: rtificil Intllignc ysin Ntworks Infrnc 1 Infrnc Suppos you r givn ysin ntwork with th grph structur n th prmtrs ll figur out Now you woul lik to us it to o infrnc You n infrnc to mk prictions or

More information

Global Register Allocation

Global Register Allocation Ltur Outlin Glol Rgistr Allotion Mmory Hirrhy Mngmnt Rgistr Allotion vi Grph Coloring Rgistr intrrn grph Grph oloring huristis Spilling Ch Mngmnt 2 Th Mmory Hirrhy Rgistrs 1 yl 256-8000 yts Ch 3 yls 256k-16M

More information

Internet Routing. IP Packet Format. IP Fragmentation & Reassembly. Principles of Internet Routing. Computer Networks 9/29/2014.

Internet Routing. IP Packet Format. IP Fragmentation & Reassembly. Principles of Internet Routing. Computer Networks 9/29/2014. omputer Networks 9/29/2014 IP Pket Formt Internet Routing Ki Shen IP protool version numer heder length (words) for qulity of servie mx numer remining hops (deremented t eh router) upper lyer protool to

More information

Rethinking Computer Architecture for Energy Limited Computing

Rethinking Computer Architecture for Energy Limited Computing Rethinking Computer Architecture for Energy Limited Computing Wen-mei Hwu ECE, CS, PCI, NCSA University of Illinois at Urana-Champaign Agenda Blue Waters and recent progress in petascale GPU computing

More information

Enterprise Digital Signage Create a New Sign

Enterprise Digital Signage Create a New Sign Enterprise Digitl Signge Crete New Sign Intended Audiene: Content dministrtors of Enterprise Digitl Signge inluding stff with remote ess to sign.pitt.edu nd the Content Mnger softwre pplition for their

More information

Solution of Linear Algebraic Equations using the Gauss-Jordan Method

Solution of Linear Algebraic Equations using the Gauss-Jordan Method Solution of Liner Algebric Equtions using the Guss-Jordn Method Populr pproch for solving liner equtions The Guss Jordn method depends on two properties of liner equtions: Scling one or more of ny of the

More information

CMPUT101 Introduction to Computing - Summer 2002

CMPUT101 Introduction to Computing - Summer 2002 CMPUT Introdution to Computing - Summer 22 %XLOGLQJ&RPSXWHU&LUFXLWV Chpter 4.4 3XUSRVH We hve looked t so fr how to uild logi gtes from trnsistors. Next we will look t how to uild iruits from logi gtes,

More information

Paradigm 5. Data Structure. Suffix trees. What is a suffix tree? Suffix tree. Simple applications. Simple applications. Algorithms

Paradigm 5. Data Structure. Suffix trees. What is a suffix tree? Suffix tree. Simple applications. Simple applications. Algorithms Prdigm. Dt Struture Known exmples: link tble, hep, Our leture: suffix tree Will involve mortize method tht will be stressed shortly in this ourse Suffix trees Wht is suffix tree? Simple pplitions History

More information

COSC 6374 Parallel Computation. Non-blocking Collective Operations. Edgar Gabriel Fall Overview

COSC 6374 Parallel Computation. Non-blocking Collective Operations. Edgar Gabriel Fall Overview COSC 6374 Prllel Computtion Non-loking Colletive Opertions Edgr Griel Fll 2014 Overview Impt of olletive ommunition opertions Impt of ommunition osts on Speedup Crtesin stenil ommunition All-to-ll ommunition

More information

This module calculates the motor speed based on a rotor position measurement when the direction information is available.

This module calculates the motor speed based on a rotor position measurement when the direction information is available. SPEED_FRQ Spd Calulator Basd on Rotor Angl With Dirtion Information Dsription This modul alulats th motor spd basd on a rotor position masurmnt whn th dirtion information is availabl. thta_l dir_qep SPEED_FRQ

More information

ASSIGNMENT 9: CACHE MEMORY NAME. Assume we are building a cache for a memory system that s just 16 bytes big 4 address bits.

ASSIGNMENT 9: CACHE MEMORY NAME. Assume we are building a cache for a memory system that s just 16 bytes big 4 address bits. . SSIGNMNT : H MMORY NM PROLM : -YT H OR -YT MMORY. ssume we are building a cache for a memory system that s just bytes big address bits. We will make a direct mapped cache that has four set, so there

More information

In order to learn which questions have been answered correctly: 1. Print these pages. 2. Answer the questions.

In order to learn which questions have been answered correctly: 1. Print these pages. 2. Answer the questions. XML: Mnging with th Jv Pltform In ordr to lrn whih qustions hv n nswrd orrtly: 1. Print ths pgs. 2. Answr th qustions. 3. Snd this ssssmnt with th nswrs vi:. FAX to (212) 967-3498. Or. Mil th nswrs to

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Lists in Lisp and Scheme

Lists in Lisp and Scheme Lists in Lisp nd Scheme Lists in Lisp nd Scheme Lists re Lisp s fundmentl dt structures, ut there re others Arrys, chrcters, strings, etc. Common Lisp hs moved on from eing merely LISt Processor However,

More information

COSC 6374 Parallel Computation. Dense Matrix Operations

COSC 6374 Parallel Computation. Dense Matrix Operations COSC 6374 Prllel Computtion Dense Mtrix Opertions Edgr Griel Fll Edgr Griel Prllel Computtion Edgr Griel erminology Dense Mtrix: ll elements of the mtrix ontin relevnt vlues ypilly stored s 2-D rry, (e.g.

More information

String comparison by transposition networks

String comparison by transposition networks String omprison y trnsposition networks Alexnder Tiskin (Joint work with Peter Krushe) Deprtment of Computer Siene University of Wrwik http://www.ds.wrwik..uk/~tiskin (inludes n extended version of this

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

High Performance Matrix-matrix Multiplication of Very Small Matrices

High Performance Matrix-matrix Multiplication of Very Small Matrices High Performance Matrix-matrix Multiplication of Very Small Matrices Ian Masliah, Marc Baboulin, ICL people University Paris-Sud - LRI Sparse Days Cerfacs, Toulouse, 1/07/2016 Context Tensor Contractions

More information

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

GPU Supercomputing From Blue Waters to Exascale

GPU Supercomputing From Blue Waters to Exascale GPU Supercomputing From Blue Waters to Exascale Wen-mei Hwu Professor, University of Illinois at Urbana-Champaign (UIUC) CTO, MulticoreWare Inc. New BW Configuration Cray System & Storage cabinets: Compute

More information

COSC 6374 Parallel Computation. Communication Performance Modeling (II) Edgar Gabriel Fall Overview. Impact of communication costs on Speedup

COSC 6374 Parallel Computation. Communication Performance Modeling (II) Edgar Gabriel Fall Overview. Impact of communication costs on Speedup COSC 6374 Prllel Computtion Communition Performne Modeling (II) Edgr Griel Fll 2015 Overview Impt of ommunition osts on Speedup Crtesin stenil ommunition All-to-ll ommunition Impt of olletive ommunition

More information

Why are GPUs so hard to program or are they? Wen-mei Hwu University of Illinois, Urbana-Champaign MulticoreWare

Why are GPUs so hard to program or are they? Wen-mei Hwu University of Illinois, Urbana-Champaign MulticoreWare Why are GPUs so hard to program or are they? Wen-mei Hwu University of Illinois, Urbana-Champaign MulticoreWare Agenda GPU Computing in Blue Waters Library Algorithms Scalability, performance, and numerical

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Containers: Queue and List

Containers: Queue and List Continers: Queue n List Queue A ontiner in whih insertion is one t one en (the til) n eletion is one t the other en (the he). Also lle FIFO (First-In, First-Out) Jori Cortell n Jori Petit Deprtment of

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

Scalability, Portability, and Productivity in GPU Computing

Scalability, Portability, and Productivity in GPU Computing Scalability, Portability, and Productivity in GPU Computing Wen-mei Hwu Sanders AMD Chair, ECE and CS University of Illinois, Urbana-Champaign CTO, MulticoreWare Agenda Performance experience in using

More information

HSHM-H110AX-5CPX HSHM-H105BX-5CPX TYPE B21, 105 SIGNAL CONTACTS HSHM-HXXXXXX-5CPX-XXXXX

HSHM-H110AX-5CPX HSHM-H105BX-5CPX TYPE B21, 105 SIGNAL CONTACTS HSHM-HXXXXXX-5CPX-XXXXX M TM HSHM PRSS-FIT HR, -ROW, HSHM SRIS FOR HIGH SP HR MTRI PPLITIONS * UP TO Gb/s T RTS * LOW ROSSTLK T HIGH FRQUNIS * / (SINGL-N/IFFRNTIL) IMPN * MOULR/SLL FORMT I -- * MT LINS PR INH * SHIPS WITH PROTTIV

More information

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for

More information

OpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016

OpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016 OpenACC. Part 2 Ned Nedialkov McMaster University Canada CS/SE 4F03 March 2016 Outline parallel construct Gang loop Worker loop Vector loop kernels construct kernels vs. parallel Data directives c 2013

More information

Characteristics of Fault Simulation. Fault Simulation Techniques. Parallel Fault Simulation. Parallel Fault Simulation

Characteristics of Fault Simulation. Fault Simulation Techniques. Parallel Fault Simulation. Parallel Fault Simulation Chrtristis o Fult Simultion Fult tivity with rspt to ult-r iruit is otn sprs oth in tim n sp. For mpl F is not tivt y th givn pttrn, whil F2 ts only th lowr prt o this iruit. Fult Simultion Thniqus Prlll

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

Introducing fractions

Introducing fractions Introduing frtions Nme Colour hlf of eh shpe: Show the following fr ons: out of out of out of Lel these fr ons: Shde these fr ons: 7 0 Represents ommon fr ons on different models Interprets the numertor

More information

Paralization on GPU using CUDA An Introduction

Paralization on GPU using CUDA An Introduction Paralization on GPU using CUDA An Introduction Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011 Outline 1 Introduction to GPU 2 Introduction to CUDA Graphics Processing

More information

S675, S750 Stretchair Parts List A D

S675, S750 Stretchair Parts List A D Page of 7 F G PRTS LIST Number Part Number escription M675-038 STR, 6" NTRL LOKING, TOTL LK TWIN WHL M675-04 STR, 6" NTRL LOKING, IR LK, TWIN WHL S-HX-ZP-M6-0 SRW, HX H, M6 X 0 MM LG 4 W-LI-ZP-5-47-03

More information

Below, are instructions about how to set each goal and report achievements in Your Club, Service, and Foundation Giving.

Below, are instructions about how to set each goal and report achievements in Your Club, Service, and Foundation Giving. Rotry Clu Cntrl is n onlin tool to hlp lus st nd trk lu gols nd hivmnts. This rfrn guid outlins th stps you nd to tk to st nd dit gols s wll s rport hivmnts in Rotry Clu Cntrl. If ny dt is displyd inorrtly,

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs

Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs Li-Wen Chang and Wen-mei W. Hwu 2.1 Introduction The tridiagonal solver has been recognized as a critical building block for many engineering

More information

ECO GUIDE TO Unstratified Samples

ECO GUIDE TO Unstratified Samples ECO GUIDE TO Unstrtifid Smpls Wht Is n Unstrtifid Smpld? If you hv didd to ondut smpl invntory, you will b ollting dt for plots lotd throughout your study r. In this typ of projt, you n hoos to strtify

More information

ISO VIEW COVER, EXPRESS EXIT 4X4 FLIP COVER OPEN VIEW EXPRESS EXIT ON TROUGH VIEW

ISO VIEW COVER, EXPRESS EXIT 4X4 FLIP COVER OPEN VIEW EXPRESS EXIT ON TROUGH VIEW RV MO WN T 00899MO OVL 07-JN-5 0078MO HUH 7-SP-5 5.90 RF.87 RF.000 RF ISO VIW SL OVR, 0.50 OVR XTNSION X FLIP OVR FGS-MX-- (NOT INLU IN KIT).07 RF 7.7 RF FLIP OVR OPN VIW SL X STRIGHT STION RF RKT, XPRSS

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and

More information

LINX MATRIX SWITCHERS FIRMWARE UPDATE INSTRUCTIONS FIRMWARE VERSION

LINX MATRIX SWITCHERS FIRMWARE UPDATE INSTRUCTIONS FIRMWARE VERSION Overview LINX MATRIX SWITCHERS FIRMWARE UPDATE INSTRUCTIONS FIRMWARE VERSION 4.4.1.0 Due to the omplex nture of this updte, plese fmilirize yourself with these instrutions nd then ontt RGB Spetrum Tehnil

More information

CSE P 501 Compilers. Register Allocation Hal Perkins Spring UW CSE P 501 Spring 2018 P-1

CSE P 501 Compilers. Register Allocation Hal Perkins Spring UW CSE P 501 Spring 2018 P-1 CSE P 501 Compilrs Rgistr Allotion Hl Prkins Spring 2018 UW CSE P 501 Spring 2018 P-1 Agn Rgistr llotion onstrints Lol mthos Fstr ompil, slowr o, ut goo nough or lots o things (JITs, ) Glol llotion rgistr

More information

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

S4289: Efficient solution of multiple scalar and block-tridiagonal equations S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] oerc.ox.ac.uk Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University,

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

Minimal Memory Abstractions

Minimal Memory Abstractions Miniml Memory Astrtions (As implemented for BioWre Corp ) Nthn Sturtevnt University of Alert GAMES Group Ferury, 7 Tlk Overview Prt I: Building Astrtions Minimizing memory requirements Performnes mesures

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

6.045J/18.400J: Automata, Computability and Complexity. Quiz 2: Solutions. Please write your name in the upper corner of each page.

6.045J/18.400J: Automata, Computability and Complexity. Quiz 2: Solutions. Please write your name in the upper corner of each page. 6045J/18400J: Automt, Computbility nd Complexity Mrh 30, 2005 Quiz 2: Solutions Prof Nny Lynh Vinod Vikuntnthn Plese write your nme in the upper orner of eh pge Problem Sore 1 2 3 4 5 6 Totl Q2-1 Problem

More information

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission) CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write

More information

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth

More information

Scalability, Portability, and Productivity in GPU Computing

Scalability, Portability, and Productivity in GPU Computing Scalability, Portability, and Productivity in GPU Computing Wen-mei Hwu Sanders AMD Chair, ECE and CS University of Illinois, Urbana-Champaign CTO, MulticoreWare Agenda 4,224 Kepler GPUs in Blue Waters

More information

NVIDIA Fermi Architecture

NVIDIA Fermi Architecture Administrivia NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 4 grades returned Project checkpoint on Monday Post an update on your blog beforehand Poster

More information

1. Trace the array for Bubble sort 34, 8, 64, 51, 32, 21. And fill in the following table

1. Trace the array for Bubble sort 34, 8, 64, 51, 32, 21. And fill in the following table 1. Trac th array for Bubbl sort 34, 8, 64, 51, 3, 1. And fill in th following tabl bubbl(intgr Array x, Intgr n) Stp 1: Intgr hold, j, pass; Stp : Boolan switchd = TRUE; Stp 3: for pass = 0 to (n - 1 &&

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information

A New Algorithm for Solving Shortest Path Problem on a Network with Imprecise Edge Weight

A New Algorithm for Solving Shortest Path Problem on a Network with Imprecise Edge Weight Availabl at http://pvamudu/aam Appl Appl Math ISSN: 193-9466 Vol 6, Issu (Dcmbr 011), pp 60 619 Applications and Applid Mathmatics: An Intrnational Journal (AAM) A Nw Algorithm for Solving Shortst Path

More information

CS 241 Week 4 Tutorial Solutions

CS 241 Week 4 Tutorial Solutions CS 4 Week 4 Tutoril Solutions Writing n Assemler, Prt & Regulr Lnguges Prt Winter 8 Assemling instrutions utomtilly. slt $d, $s, $t. Solution: $d, $s, nd $t ll fit in -it signed integers sine they re 5-it

More information

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,

More information

TYPICAL RAISED POSITION

TYPICAL RAISED POSITION UPPR 1. TH LOTION OF RMP LOSUR GTS ND MOUNTING HIGHT OF PIVOT SHLL VRIFID Y TH NGINR.. HIGHT OF GUIDS MY VRID S RQUIRD FOR WRNING LIGHT LRN. 3. FIRGLSS/LUMINUM ND SHLL SUPPLID Y TH SM VNDOR. 4. TO MOUNTD

More information

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014 Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each

More information

Introduc)on to GPU Programming

Introduc)on to GPU Programming Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance

More information

Lesson 4.4. Euler Circuits and Paths. Explore This

Lesson 4.4. Euler Circuits and Paths. Explore This Lesson 4.4 Euler Ciruits nd Pths Now tht you re fmilir with some of the onepts of grphs nd the wy grphs onvey onnetions nd reltionships, it s time to egin exploring how they n e used to model mny different

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016 RECENT TRENDS IN GPU ARCHITECTURES Perspectives of GPU computing in Science, 26 th Sept 2016 NVIDIA THE AI COMPUTING COMPANY GPU Computing Computer Graphics Artificial Intelligence 2 NVIDIA POWERS WORLD

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

The Size of the 3D Visibility Skeleton: Analysis and Application

The Size of the 3D Visibility Skeleton: Analysis and Application Th Siz of th 3D Visibility Sklton: Analysis and Application Ph.D. thsis proposal Linqiao Zhang lzhang15@cs.mcgill.ca School of Computr Scinc, McGill Univrsity March 20, 2008 thsis proposal: Th Siz of th

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

Compilers. Topic 4. The Symbol Table and Block Structure PART II. Mick O Donnell: Alfonso Ortega:

Compilers. Topic 4. The Symbol Table and Block Structure PART II. Mick O Donnell: Alfonso Ortega: Compilers Topi 4 The ol Tle nd Blok Struture PART II Mik O Donnell: mihel.odonnell@um.es Alfonso Orteg: lfonso.orteg@um.es Topi 2: Blok Struture 2 1 ol tles with lok strutures Blok Struture Progrmming

More information

ZZ - Advanced Math Review 2017

ZZ - Advanced Math Review 2017 ZZ - Advnced Mth Review Mtrix Multipliction Given! nd! find the sum of the elements of the product BA First, rewrite the mtrices in the correct order to multiply The product is BA hs order x since B is

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015 The Rise of Open Programming Frameworks JC BARATAULT IWOCL May 2015 1,000+ OpenCL projects SourceForge GitHub Google Code BitBucket 2 TUM.3D Virtual Wind Tunnel 10K C++ lines of code, 30 GPU kernels CUDA

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information