CS4961 Parallel Programming. Lecture 1: Introduction 08/25/2009. Course Details. Mary Hall August 25, Today s Lecture.

Similar documents
CS6963. Course Information. Parallel Programming for Graphics Processing Units (GPUs) Lecture 1: Introduction. Source Materials for Today s Lecture

CS427 Multicore Architecture and Parallel Computing

CS61C : Machine Structures

CS61C : Machine Structures

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

CS 194 Parallel Programming. Why Program for Parallelism?

CS267 / E233 Applications of Parallel Computers. Lecture 1: Introduction 1/18/99

An Introduction to Parallel Programming

High-Performance Scientific Computing

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

COMP 322: Fundamentals of Parallel Programming

What is this class all about?

PART I - Fundamentals of Parallel Computing

Concurrency & Parallelism, 10 mi

ECE 588/688 Advanced Computer Architecture II

CS671 Parallel Programming in the Many-Core Era

What is this class all about?

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Trends in HPC (hardware complexity and software challenges)

BİL 542 Parallel Computing

The Beauty and Joy of Computing

ECE 588/688 Advanced Computer Architecture II

Introduction to Computational Science (aka Scientific Computing)

The Art of Parallel Processing

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141

Fundamentals of Computer Design

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Lecture #1. Teach you how to make sure your circuit works Do you want your transistor to be the one that screws up a 1 billion transistor chip?

EE141- Spring 2007 Introduction to Digital Integrated Circuits

Top500 Supercomputer list

What is this class all about?

Lecture 1: Gentle Introduction to GPUs

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

High Performance Computing. Introduction to Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Lecture 1: Why Parallelism? Parallel Computer Architecture and Programming CMU , Spring 2013

EE282H: Computer Architecture and Organization. EE282H: Computer Architecture and Organization -- Course Overview

Lecture 1. Introduction Course Overview

CS 475: Parallel Programming Introduction

CS240A: Applied Parallel Computing. Introduction

Chapter 1: Introduction to Parallel Computing

Advances of parallel computing. Kirill Bogachev May 2016

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Fra superdatamaskiner til grafikkprosessorer og

Computer Architecture!

COMP 633 Parallel Computing.

Introduction to Parallel Computing

Παράλληλη Επεξεργασία

How to Write Fast Code , spring th Lecture, Mar. 31 st

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

Accelerating CFD with Graphics Hardware

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

EE141- Spring 2004 Introduction to Digital Integrated Circuits. What is this class about?

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Introduction to High-Performance Computing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

The Future of High Performance Computing

Fundamentals of Quantitative Design and Analysis

Parallel and High Performance Computing CSE 745

Computer Architecture

Fundamentals of Computers Design

Multicore Computing and Scientific Discovery

Introduction to Parallel Programming in OpenMp Dr. Yogish Sabharwal Department of Computer Science & Engineering Indian Institute of Technology, Delhi

EE141- Spring 2002 Introduction to Digital Integrated Circuits. What is this class about?

Trends and Challenges in Multicore Programming

CS4230 Parallel Programming. Lecture 7: Loop Scheduling cont., and Data Dependences 9/6/12. Administrative. Mary Hall.

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

CS240A: Applied Parallel Computing. Introduction

Hakam Zaidan Stephen Moore

Introduction to High-Performance Computing

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor

Outline Marquette University

Chapter 18 - Multicore Computers

Computer Architecture

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Introduction to parallel Computing

Parallel Computers. c R. Leduc

High Performance Computing

CSL 860: Modern Parallel

Bandwidth Avoiding Stencil Computations

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

WHY PARALLEL PROCESSING? (CE-401)

What is a parallel computer?

High-Performance and Parallel Computing

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

UC Berkeley CS61C : Machine Structures

CSC 447: Parallel Programming for Multi- Core and Cluster Systems. Lectures TTh, 11:00-12:15 from January 16, 2018 until 25, 2018 Prerequisites

How HPC Hardware and Software are Evolving Towards Exascale

CS/EE 6810: Computer Architecture

CSE 141: Computer Architecture. Professor: Michael Taylor. UCSD Department of Computer Science & Engineering

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

COMP 322 / ELEC 323: Fundamentals of Parallel Programming

CS420/CSE 402/ECE 492. Introduction to Parallel Programming for Scientists and Engineers. Spring 2006

Transcription:

Parallel Programming Lecture 1: Introduction Mary Hall August 25, 2009 Course Details Time and Location: TuTh, 9:10-10:30 AM, WEB L112 Course Website - http://www.eng.utah.edu/~cs4961/ Instructor: Mary Hall, mhall@cs.utah.edu, http://www.cs.utah.edu/~mhall/ - Office Hours: Tu 10:45-11:15 AM; Wed 11:00-11:30 AM TA: Sriram Aananthakrishnan, sriram@cs.utah.edu - Office Hours: TBD Textbook - Principles of Parallel Programming, Calvin Lin and Lawrence Snyder. - Also, readings and notes provided for MPI, CUDA, Locality and Parallel Algs. 1 2 Today s Lecture Overview of course (done) Important problems require powerful computers - and powerful computers must be parallel. - Increasing importance of educating parallel programmers (you!) What sorts of architectures in this class - Multimedia extensions, multi-cores, GPUs, networked clusters Developing high-performance parallel applications - An optimization perspective Outline Logistics Introduction Technology Drivers for Multi-Core Paradigm Shift Origins of Parallel Programming: Large-scale scientific simulations The fastest computer in the world today Why writing fast parallel programs is hard Some material for this lecture drawn from: Kathy Yelick and Jim Demmel, UC Berkeley Quentin Stout, University of Michigan, (see http://www.eecs.umich.edu/~qstout/parallel.html) Top 500 list (http://www.top500.org) 3 4 1

Course Objectives Learn how to program parallel processors and systems - Learn how to think in parallel and write correct parallel programs - Achieve performance and scalability through understanding of architecture and software mapping Significant hands-on programming experience - Develop real applications on real hardware Discuss the current parallel computing context - What are the drivers that make this course timely - Contemporary programming models and architectures, and where is the field going 5 Parallel and Distributed Computing Parallel computing (processing): - the use of two or more processors (computers), usually within a single system, working simultaneously to solve a single problem. Distributed computing (processing): - any computing that involves multiple computers remote from each other that each have a role in a computation problem or information processing. Parallel programming: - the human process of developing programs that express what computations should be executed in parallel. 6 Why is Parallel Programming Important Now? All computers are now parallel computers (embedded, commodity, supercomputer) - On-chip architectures look like parallel computers - Languages, software development and compilation strategies originally developed for high end (supercomputers) are now becoming important for many other domains Why? - Technology trends Looking to the future - Parallel computing for the masses demands better parallel programming paradigms - And more people who are trained in writing parallel programs (possibly you!) - How to put all these vast machine resources to the best use! Detour: Technology as Driver for Multi-Core Paradigm Shift Do you know why most computers sold today are parallel computers? Let s talk about the technology trends 7 8 2

8/25/09 Technology Trends: Microprocessor Capacity Transistor count still rising Technology Trends: Power Density Limits Serial Performance Clock speed flattening sharply Slide source: Maurice Herlihy Moore s Law: Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. 9 The Multi-Core Paradigm Shift 10 Proof of Significance: Popular Press What to do with all these transistors? Key ideas: - Movement away from increasingly complex processor design and faster clocks - Replicated functionality (i.e., parallel) is simpler to design - Resources more efficiently utilized - Huge power management advantages This week s issue of Newsweek! Article on 25 things smart people should know See http://www.newsweek.com/id/212142 All Computers are Parallel Computers. 11 12 3

methods. Scientific Simulation: The Third Pillar of Science Traditional scientific and engineering paradigm: 1)Do theory or paper design. 2)Perform experiments or build system. Limitations: - Too difficult -- build large wind tunnels. - Too expensive -- build a throw-away passenger jet. - Too slow -- wait for climate or galactic evolution. - Too dangerous -- weapons, drug design, climate experimentation. Computational science paradigm: 3)Use high performance computer systems to simulate the phenomenon - Base on known physical laws and efficient numerical 13 The quest for increasingly more powerful machines Scientific simulation will continue to push on system requirements: - To increase the precision of the result - To get to an answer sooner (e.g., climate modeling, disaster modeling) The U.S. will continue to acquire systems of increasing scale - For the above reasons - And to maintain competitiveness 14 A Similar Phenomenon in Commodity Systems The fastest computer in the world today What is its name? RoadRunner More capabilities in software Integration across software Faster response More realistic graphics 15 Where is it located? How many processors does it have? What kind of processors? How fast is it? See http://www.top500.org Los Alamos National Laboratory ~19,000 processor chips (~129,600 processors ) AMD Opterons and IBM Cell/BE (in Playstations) 1.105 Petaflop/second One quadrilion operations/s 1 x 10 16 16 4

Example: Global Climate Modeling Problem Problem is to compute: f(latitude, longitude, elevation, time) temperature, pressure, humidity, wind velocity Approach: - Discretize the domain, e.g., a measurement point every 10 km - Devise an algorithm to predict weather at time t+δt given t High Resolution Climate Modeling on NERSC-3 P. Duffy, et al., LLNL Uses: - Predict major events, e.g., El Nino - Use in setting air emissions standards Source: http://www.epm.ornl.gov/chammp/chammp.html 17 18 Some Characteristics of Scientific Simulation Discretize physical or conceptual space into a grid - Simpler if regular, may be more representative if adaptive Perform local computations on grid - Given yesterday s temperature and weather pattern, what is today s expected temperature? Communicate partial results between grids - Contribute local weather result to understand global weather pattern. Repeat for a set of time steps Possibly perform other calculations with results - Given weather model, what area should evacuate for a hurricane? Another processor computes this part in parallel Example of Discretizing a Domain One processor computes this part Processors in adjacent blocks in the grid communicate their result. 19 20 5

Parallel Programming Complexity An Analogy to Preparing Thanksgiving Dinner Enough parallelism? (Amdahl s Law) - Suppose you want to just serve turkey Granularity - How frequently must each assistant report to the chef - After each stroke of a knife? Each step of a recipe? Each dish completed? All of these things makes parallel Locality programming - Grab the spices one even at a time? harder Or collect than ones that sequential are needed prior to starting a dish? programming. Load balance - Each assistant gets a dish? Preparing stuffing vs. cooking green beans? Coordination and Synchronization - Person chopping onions for stuffing can also supply green beans - Start pie after turkey is out of the oven Finding Enough Parallelism Suppose only part of an application seems parallel Amdahl s law - let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable - P = number of processors Speedup(P) = Time(1)/Time(P) <= 1/(s + (1-s)/P) <= 1/s Even if the parallel part speeds up perfectly performance is limited by the sequential part 21 22 Overhead of Parallelism Given enough parallel work, this is the biggest barrier to getting desired speedup Parallelism overheads include: - cost of starting a thread or process - cost of communicating shared data - cost of synchronizing - extra (redundant) computation Each of these can be in the range of milliseconds (=millions of flops) on some systems Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work Locality and Parallelism Conventional Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory Large memories are slow, fast memories are small Program should do most work on local data Proc Cache L2 Cache L3 Cache Memory potential interconnects 23 24 6

Load Imbalance Load imbalance is the time that some processors in the system are idle due to - insufficient parallelism (during that phase) - unequal size tasks Examples of the latter - adapting to interesting parts of a domain - tree-structured computations - fundamentally unstructured problems Algorithm needs to balance load 25 Some Popular Parallel Programming Models Pthreads (parallel threads) - Low level expression of threads, which are independent computations that can execute in parallel MPI (Message Passing Interface) - Most widely used at the very high-end machines - Extension to common sequential languages, express communication between different processes along with parallelism Map-Reduce (popularized by Google) - Map: apply the same computation to lots of different data (usually in distributed files) and produce local results - Reduce: compute global result from set of local results CUDA (Compute Unified Device Architecture) - Proprietary programming language for NVIDIA graphics processors 26 Summary of Lecture Solving the Parallel Programming Problem - Key technical challenge facing today s computing industry, government agencies and scientists Scientific simulation discretizes some space into a grid - Perform local computations on grid - Communicate partial results between grids - Repeat for a set of time steps - Possibly perform other calculations with results Commodity parallel programming can draw from this history and move forward in a new direction Writing fast parallel programs is difficult - Amdahl s Law Must parallelize most of computation - Data Locality - Communication and Synchronization - Load Imbalance 27 Next Time An exploration of parallel algorithms and their features First written homework assignment 28 7