CS 668 Parallel Computing Spring 2011

Size: px

Start display at page:

Download "CS 668 Parallel Computing Spring 2011"

Willis Chase
5 years ago
Views:

1 CS 668 Parallel Computing Spring 2011 Prof. Fred Office Hours: 11-1 MW or by appointment Tel: Meeting: TuTh 2:00-3:25 in RecCenter 3240

2 Lecture 1: Welcome Goals of this course Syllabus, policies, grading Blackboard Resources Introduction/Motivation Scope of the Problems in Parallel Computing Massive Parallelism on the Desktop

3 Learning Outcomes: Goals Students will learn the computational thinking and programming skills needed to achieve terascale computing performance in all science and engineering disciplines. Students will learn algorithmic design patterns for parallel computing, critical system and architectural design issues, and programming methods and analysis for parallel computing software.

4 Workload/Grading The grading will be based on lab and homework assignments, and a final project presentation and report. The final grade will be a weighted average - Homeworks 30%, Labs 30%, Final Presentation 20%, Final Report 20%. Late homework is accepted but will negatively affect your grade. There will be no exams. Final Project Information : Students will be expected to work in small teams on a final project. The work for the project can be divided up between team members in any way that works best for group. All individual work and outcomes must be clearly documented and self-assessed. Project Status Levels: Green, Gold, Platinum

5 Course Materials Textbooks are Optional Programming Massively Parallel Processors by Hwu and Kirk. Parallel Programming in C with MPI and OpenMP, Michael J. Quinn Notes will be made available on BB Audio/Video Podcast recordings should also be available starting next week Extensive use of Forums on BB is encouraged Status badges for best Q&A Lab Equipment Your own PCs running G80 emulators Your own Mac or PCs with a CUDA enabled GPU

6 Topics Tentative Schedule Week 1: Tue: Lect 1: Introduction to Parallel Computing; Thu: Lect 2 Parallel Systems and Architectures Week 2: Tue: Lect 3 GPU Computing and CUDA Introduction; Thu: Lect 4 CUDA threading model Week 3: Tue: Lect 5 - CUDA memory model Thu: Lect 6 CUDA memory tiling Week 4: Tue: Lect7 Parallel Design Patterns Thu: Lecture 8 ScalabilityAnalysis Week 5: Tue: Lect 9 Parallel Sieve Thu: Lecture 10 Parallel Reductions Week 6: Tue: Lect 11 - Parallel Dynamic Programming; Thurs: Lect 12 - Parallel FFT Week 7: Tue: Lect 13 - Monte Carlo Methods; Thurs: Lect 14 - Parallel Machine Learning Week 8: Tue: Lect 15 - Alternative Architectures; Thurs: Lect 16 - Alternative Methodologies. Weeks 9-10: Final Project Demonstrations

7 Policies Academic Honesty: Plagiarism on assignments will not be tolerated. See your student code of conduct for more on the consequences of academic misconduct. There are no small offenses.

8 What is the Ralph Regula School? The Ralph Regula School of Computational Science is a statewide, virtual school focused on computational science. It is a collaborative effort of the Ohio Board of Regents, Ohio Supercomputer Center, Ohio Learning Network and Ohio's colleges and universities. With funding from NSF, the school acts as a coordinating entity for a variety of computational science education activities aimed at making education in computational science available to students across Ohio, as well as to workers seeking continuing education about these technologies. Website:

9 RRSCS Competencies This course meets the competencies for Area 6 for the Minor Program in Computational Science of the Ralph Regula School of Computational Science. The following formal competencies are addressed in this course: 1. Describe the fundamental concepts of parallel programming and related systems and architectures. 2. Demonstrate parallel programming concepts using modern computational platforms, such as CUDA, OpenCL, MPI, and OpenMP. 3. Demonstrate knowledge of parallel scalability analysis. 4. Demonstrate knowledge of parallel programming libraries.

10 Intro to Parallel Computing PC is needed by People who solve Science and Engineering problems through physical modeling Materials / Superconductivity Fluid Flow Weather/Climate Structural Deformation Genetics / Protein interactions Seismic Many Research Projects in Natural Sciences and Engineering cannot exist without Parallel Computing

11 Applications Videos Applications in Physics and Geology Simulation of Large-Scale Structure of Universe Stability Simulation

12 Why are S&E problems so large? 3-Dimensional If you want to increase the level of resolution by factor of 10, problem size increases by 10 3 Many Length Scales (both time and space) If you want to observe the interactions between very small local phenomenon and larger more global phenomenon The number of relationships between data items grows quadratically. Example: human genome 3.2 G base pairs means about 5,000,000,000,000,000,000=5E relations

13 How to solve these large problems? Only way is to take advantage of parallelism Large problems generally have many operations which can be performed concurrently Parallelism can be exploited at many levels by computer hardware Within the CPU core, multiple functional units, pipelining Within the Chip, many cores On a node, multiple chips In a system, many nodes On the grid, many systems

14 However. Parallelism has overheads At the core and chip level the cost is complexity and money Most applications get only a fraction of peak performance (10%-20%) At the chip and node level, memory bus can get saturated if too many cores Between nodes, the communication infrastructure is typically much slower than the datapath within CPU On the grid, reliability and security are main issues

15 Necessity Yields Modest Success

16 Categories of Parallelism

17 Why not just use a Compiler?

18 Can we Extend Existing Languages?

19 Or Create New Parallel Languages?

20 Massively Parallel Processing on Desktop A quiet revolution and potential build-up Calculation: TFLOPS vs. 100 GFLOPS Memory Bandwidth: ~10x Many-core GPU Multi-core CPU Courtesy: John Owens GPU in every PC massive volume and potential impact 3/2011 AMD Radeon HD TFLOPS. Price: $

21 GeForce 8800 (2007) 16 highly threaded SM s, >128 FPU s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 21

22 Fermi (2010) ~1.5TFLOPS (SP)/~800GFLOPS (DP)! 230 GB/s DRAM Bandwidth! David Kirk/NVIDIA and Wen-mei W. Hwu, ! ECE 498AL, Spring 2010 University of Illinois, Urbana-Champaign! 22

23 Future Apps Reflect a Concurrent World Exciting applications in future mass computing market have been traditionally considered supercomputing applications Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These Super-apps represent and model physical, concurrent world Various granularities of parallelism exist, but programming model must not hinder parallel implementation data delivery needs careful management 23

Stretching Traditional Architectures Traditional parallel architectures cover some super-applications DSP, GPU, network apps, Scientific The game is to grow mainstream

24 Stretching Traditional Architectures Traditional parallel architectures cover some super-applications DSP, GPU, network apps, Scientific The game is to grow mainstream architectures out or domain-specific architectures in CUDA is latter Traditional applications Current architecture coverage New applications Domain-specific architecture coverage Obstacles 24

25 Previous Projects Application Description Source Kernel % time H.264 LBM RC5-72 FEM RPES PNS SAXPY TRACF FDTD MRI-Q SPEC 06 version, change in guess vector 34, % SPEC 06 version, change to single precision and print fewer reports 1, >99% Distributed.net RC5-72 challenge client code 1, >99% Finite element modeling, simulation of 1, % 3D graded materials Rye Polynomial Equation Solver, quantum chem, 2-electron repulsion 1, % Petri Net simulation of a distributed >99% system Single-precision implementation of saxpy, used in Linpack s Gaussian elim. routine Two Point Angular Correlation Function >99% % 1, % Finite-Difference Time Domain analysis of 2D electromagnetic wave propagation 25 Computing a matrix Q, a scanner s configuration in MRI reconstruction >99%

26 Speedup of Applications GPU Speedup Relative to CPU Kernel Application H.264 LBM RC5-72 FEM RPES PNS SAXPY TPACF FDTD MRI-Q MRI- FHD GeForce 8800 GTX vs. 2.2GHz Opteron speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads 25 to 400 speedup if the function s data requirements and control flow suit the GPU and the application is optimized 26

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX Wen-mei Hwu with David Kirk, Shane Ryoo, Christopher Rodrigues, John Stratton, Kuangwei Huang Overview