CSC 447: Parallel Programming for Multi- Core and Cluster Systems. Lectures TTh, 11:00-12:15 from January 16, 2018 until 25, 2018 Prerequisites

Similar documents
CSC 447: Parallel Programming for Multi- Core and Cluster Systems

CS 194 Parallel Programming. Why Program for Parallelism?

CS427 Multicore Architecture and Parallel Computing

CSC 634: Networks Programming

EITF20: Computer Architecture Part1.1.1: Introduction

Computer Architecture

What is this class all about?

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

Fundamentals of Computer Design

High-Performance Scientific Computing

What is this class all about?

Computer Architecture!

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

CSE 141: Computer Architecture. Professor: Michael Taylor. UCSD Department of Computer Science & Engineering

Fundamentals of Computers Design

Computer Architecture!

What is this class all about?

EE141- Spring 2007 Introduction to Digital Integrated Circuits

Computer Architecture. Fall Dongkun Shin, SKKU

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:

EE586 VLSI Design. Partha Pande School of EECS Washington State University

Lecture #1. Teach you how to make sure your circuit works Do you want your transistor to be the one that screws up a 1 billion transistor chip?

CS5222 Advanced Computer Architecture. Lecture 1 Introduction

The Art of Parallel Processing

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Computer Architecture!

COMP 633 Parallel Computing.

Computer Architecture

Jin-Fu Li. Department of Electrical Engineering. Jhongli, Taiwan

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141

EE141- Spring 2002 Introduction to Digital Integrated Circuits. What is this class about?

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CS671 Parallel Programming in the Many-Core Era

Parallel Computing. Parallel Computing. Hwansoo Han

EE141- Spring 2004 Introduction to Digital Integrated Circuits. What is this class about?

COMP 322: Fundamentals of Parallel Programming

Why Parallel Architecture

EITF20: Computer Architecture Part1.1.1: Introduction

Fra superdatamaskiner til grafikkprosessorer og

First, the need for parallel processing and the limitations of uniprocessors are introduced.

Elettronica T moduli I e II

How What When Why CSC3501 FALL07 CSC3501 FALL07. Louisiana State University 1- Introduction - 1. Louisiana State University 1- Introduction - 2

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

ECE 588/688 Advanced Computer Architecture II

Spring 2016 :: CSE 502 Computer Architecture. Introduction. Nima Honarmand

Multicore and Parallel Processing

Digital Integrated Circuits

Computer Architecture. R. Poss

The Implications of Multi-core

Microelettronica. J. M. Rabaey, "Digital integrated circuits: a design perspective" EE141 Microelettronica

Moore s Law. CS 6534: Tech Trends / Intro. Good Ol Days: Frequency Scaling. The Power Wall. Charles Reiss. 24 August 2016

CS 6534: Tech Trends / Intro

Lecture 1: Why Parallelism? Parallel Computer Architecture and Programming CMU , Spring 2013

Introduction to Parallel Computing

CAD for VLSI. Debdeep Mukhopadhyay IIT Madras

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

CS61C Machine Structures. Lecture 1 Introduction. 8/25/2003 Brian Harvey. John Wawrzynek (Warznek) www-inst.eecs.berkeley.

CS61C Machine Structures. Lecture 1 Introduction. 8/27/2006 John Wawrzynek (Warzneck)

An Introduction to Parallel Programming

Concurrency & Parallelism, 10 mi

EE3032 Introduction to VLSI Design

ECE520 VLSI Design. Lecture 1: Introduction to VLSI Technology. Payman Zarkesh-Ha

Supercomputing and Mass Market Desktops

Administrivia. Minute Essay From 4/11

CS758: Multicore Programming

CS61C : Machine Structures

CS61C : Machine Structures

COMP 322 / ELEC 323: Fundamentals of Parallel Programming

ECE 475/CS 416 Computer Architecture - Introduction. Today s Agenda. Edward Suh Computer Systems Laboratory

Introduction. Summary. Why computer architecture? Technology trends Cost issues

Lecture 1: Introduction

administrivia final hour exam next Wednesday covers assembly language like hw and worksheets

WHY PARALLEL PROCESSING? (CE-401)

URL: Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

CMPEN 411 VLSI Digital Circuits. Lecture 01: Introduction

IT 252 Computer Organization and Architecture. Introduction. Chia-Chi Teng

Mul$core and Roofline Model. Sarala Arunagiri 26 September 2013

ECE 588/688 Advanced Computer Architecture II

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator).

Multicore Computing and Scientific Discovery

The Processor: Instruction-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism

UC Berkeley CS61C : Machine Structures

When and Where? Course Information. Expected Background ECE 486/586. Computer Architecture. Lecture # 1. Spring Portland State University

The Beauty and Joy of Computing

Online Course Evaluation. What we will do in the last week?

New Challenges in Microarchitecture and Compiler Design

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison

Saman Amarasinghe and Rodric Rabbah Massachusetts Institute of Technology

Parallel Programming. Michael Gerndt Technische Universität München

URL: Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

CS654 Advanced Computer Architecture. Lec 1 - Introduction

Transcription:

CSC 447: Parallel Programming for Multi- Core and Cluster Systems Introduction and A dministrivia Haidar M. Harmanani Spring 2018 Course Introduction Lectures TTh, 11:00-12:15 from January 16, 2018 until 25, 2018 Prerequisites o Know how to program o Data Structures o Computer Architecture would be helpful but not required. Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 2

Grading and Class Policies Grading Midterm: 25% Final: 40% Programming Assignments [4-6]: 25% Course Project: 10% Exam Details Exams are closed book, closed notes All assignments must be your own original work. Cheating/copying/partnering will not be tolerated Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 3 Peer Instruction Increase real-time learning in lecture, test understanding of concepts vs. details We will use multiple choice questions to initiate discussions and reinforce learning 1-2 minutes to decide yourself 2 minutes in pairs/triples to reach consensus. Teach others! 2 minute discussion of answers, questions, clarifications Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 4

Lab Assignments All assignments and handouts will be communicated via piazza Make sure you enable your account Use piazza for questions and inquiries No questions will be answered via email All assignments must be submitted via github git is a distributed version control system Version control systems are better tools for sharing code than emailing files, using flash drives, or Dropbox Make sure you get a private repo o Apply for a free account: https://education.github.com/discount_requests/new Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 5 Late Policy Assignments are due at 5:00 pm on the due date Every day your project or homework is late (even by a day), a 15 points is assessed No credit if more than 3 days late I will be checking on your progress using your github accounts No break for those who procrastinate! Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 6

Policy on Assignments and Independent Work With the exception of laboratories and assignments (projects and HW) that explicitly permit you to work in groups, all homework and projects are to be YOUR work and your work ALONE. It is NOT acceptable to copy solutions from other students. It is NOT acceptable to copy (or start your) solutions from the Web. PARTNER TEAMS MAY NOT WORK WITH OTHER PARTNER TEAMS You are encouraged to help teach other to debug. Beyond that, we don t want you sharing approaches or ideas or code or whiteboarding with other students, since sometimes the point of the assignment is the algorithm and if you share that, they won t learn what we want them to learn). We expect that what you hand in is yours. It is NOT acceptable to leave your code anywhere where an unscrupulous student could find and steal it (e.g., public GITHUBs, walking away while leaving yourself logged on, leaving printouts lying around,etc) The first offense is a zero on the assignment and an F in the course the second time Both Giver and Receiver are equally culpable and suffer equal penalties Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 7 Architecture of a typical Lecture Full Attention Administrivia And in conclusion 10 30 35 72 75 Time (minutes) Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 8

Contact Information Haidar M. Harmanani Office: Block A, 810 Hours: TTh 8:00-9:30 or by appointment. Email: haidar@lau.edu.lb Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 9 Course Materials Optional Required Optional Optional Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 10

Computers/Programming You will be using: The lab for MPI, OpenMP, and GPU programming The lab or your own machines for using Pthreads and OpenMP CUDA Programming in the Computer Science Lab or on your own machine if you have a CUDA-capable GPU The Cloud for GPU Tutorial Labs o Access will be provided in due time Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 11 Administrative Questions?

Definitions Parallel computing Using parallel computer to solve single problems faster Parallel computer A computer with multiple processors that supports parallel programming Parallel programming Programming in a language that supports concurrency explicitly Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 13 Parallel Processing What is it? A parallel computer is a computer system that uses multiple processing elements simultaneously in a cooperative manner to solve a computational problem Parallel processing includes techniques and technologies that make it possible to compute in parallel Hardware, networks, operating systems, parallel libraries, languages, compilers, algorithms, tools, Parallel computing is an evolution of serial computing Parallelism is natural Computing problems differ in level / type of parallelism Parallelism is all about performance! Really? Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 14

Concurrency Consider multiple tasks to be executed in a computer Tasks are concurrent with respect to each if They can execute at the same time (concurrent execution) Implies that there are no dependencies between the tasks Dependencies If a task requires results produced by other tasks in order to execute correctly, the task s execution is dependent If two tasks are dependent, they are not concurrent Some form of synchronization must be used to enforce (satisfy) dependencies Concurrency is fundamental to computer science Operating systems, databases, networking, Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 15 Concurrency and Parallelism Concurrent is not the same as parallel! Why? Parallel execution Concurrent tasks actually execute at the same time Multiple (processing) resources have to be available Parallelism = concurrency + parallel hardware Both are required Find concurrent execution opportunities Develop application to execute in parallel Run application on parallel hardware Is a parallel application a concurrent application? Is a parallel application run with one processor parallel? Why or why not? Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 16

Why Parallel Computing Now? Researchers have been using parallel computing for decades: Mostly used in computational science and engineering Problems too large to solve on one computer; use 100s or 1000s Nature Observation Modern Scientific Method Numerical Simulation Physical Experimentation Classical Science Theory Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 17 Why Parallel Computing Now? Many companies in the 80s/90s bet on parallel computing and failed Computers got faster too quickly for there to be a large market Why an undergraduate course on parallel programming? Because the entire computing industry has bet on parallelism There is a desperate need for parallel programmers There are 3 ways to improve performance: Work Harder Work Smarter Get Help Computer Analogy Using faster hardware Optimized algorithms and techniques used to solve computational tasks Multiple computers to solve a particular task Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 18

Technology Trends: Microprocessor Capacity Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 19 Technology Trends: Microprocessor Capacity Micros Speed (log scale) Supercomputers Mainframes Minis Time Microprocessors have become smaller, denser, and more powerful. Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 20

50 Years of Speed Increases Today > 1 trillion flops ENIAC 350 flops Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 21 CPUs 1 Million Times Faster Faster clock speeds Greater system concurrency Multiple functional units Concurrent instruction execution Speculative instruction execution Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 22

Systems 1 Billion Times Faster Processors are 1 million times faster Combine thousands of processors Parallel computer Multiple processors Supports parallel programming Parallel computing = Using a parallel computer to execute a program faster Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 23 Microprocessor Transistors and Clock Rate 100,000,000 1000 Transistors 10,000,000 1,000,000 100,000 i80286 i80386 R10000 Pentium R3000 R2000 Clock Rate (MHz) 100 10 10,000 i8080 i8086 1 i4004 1,000 1970 1975 1980 1985 1990 1995 2000 2005 Year 0.1 1970 1980 1990 2000 Year Why bother with parallel programming? Just wait a year or two Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 24

Limit #1: Power density Power Density (W/cm 2 ) Scaling clock speed (business as usual) will not work 10000 1000 100 10 1 4004 8008 8080 8086 8085 286 386 Nuclear Reactor Hot Plate Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 25 486 Rocket Nozzle P6 Pentium 1970 1980 1990 2000 2010 Year Sun s Surface Intel VP Patrick Gelsinger (ISSCC 2001): If scaling continues at present pace, by 2005, high speed processors would have power density of nuclear reactor, by 2010, a rocket nozzle, and by 2015, surface of sun. Device scaling (scaled down by L) scale doping increased by a factor of S Increasing the channel doping density decreases the depletion width Þ improves isolation between source and drain during OFF status Þ permits distance between the source and drain regions to be scaled (very idealistic NMOS transistor) Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 26

Parallelism Saves Power Exploit explicit parallelism for reducing power Power = C (C 2C * V 22 */4 FF)/4 F* F/2 Performance = (Cores = 2Cores * FF)*1 * F/2 Capacitance Voltage Frequency Using additional cores Increase density (= more transistors = more capacitance) Can increase cores (2x) and performance (2x) Or increase cores (2x), but decrease frequency (1/2): same performance at ¼ the power Additional benefits Small/simple cores à more predictable performance Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 27 Limit #2: Hidden Parallelism Tapped Out Application performance was increasing by 52% per year as measured by the SpecInt benchmarks here 10000 1000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 VAX: 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002??%/year 100 52%/year 10 1 25%/year ½ due to transistor density ½ due to architecture changes, e.g., Instruction Level Parallelism (ILP) 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 28

Limit #2: Hidden Parallelism Tapped Out Superscalar (SS) designs were the state of the art; many forms of parallelism not visible to programmer multiple instruction issue dynamic scheduling: hardware discovers parallelism between instructions speculative execution: look past predicted branches non-blocking caches: multiple outstanding memory ops You may have heard of these in CSC320, but you haven t needed to know about them to write software Unfortunately, these sources have been used up Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 29 Uniprocessor Performance (SPECint) Today 10000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006??%/year 3X Performance (vs. VAX-11/780) 1000 100 10 25%/year 52%/year 2x every 5 years? VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86:??%/year 2002 to present Þ Sea change in chip design: multiple cores or processors per chip 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 30

Limit #3: Chip Yield Manufacturing costs and yield problems limit use of density Moore s (Rock s) 2 nd law: fabrication costs go up Yield (% usable chips) drops Parallelism can help More smaller, simpler processors are easier to design and validate Can use partially working chips: E.g., Cell processor (PS3) is sold with 7 out of 8 on to improve yield Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 31 Limit #4: Speed of Light (Fundamental) 1 Tflop/s, 1 Tbyte sequential machine Consider the 1 Tflop/s sequential machine: Data must travel some distance, r, to get from memory to CPU. To get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/s. o Thus r < c/1012 = 0.3 mm. Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area: Each bit occupies about 1 square Angstrom, or the size of a small atom. No choice but parallelism r = 0.3 mm Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 32

Revolution is Happening Now Chip density is continuing increase ~2x every 2 years Clock speed is not Number of processor cores may double instead There is little or no hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 33 Tunnel Vision by Experts On several recent occasions, I have been asked whether parallel computing will soon be relegated to the trash heap reserved for promising technologies that never quite make it. o Ken Kennedy, CRPC Directory, 1994 640K [of memory] ought to be enough for anybody. o Bill Gates, chairman of Microsoft,1981. There is no reason for any individual to have a computer in their home o Ken Olson, president and founder of Digital Equipment Corporation, 1977. I think there is a world market for maybe five computers. o Thomas Watson, chairman of IBM, 1943. Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 34

Why Parallelism (2018)? All major processor vendors are producing multicore chips Every machine will soon be a parallel machine All programmers will be parallel programmers??? New software model Want a new feature? Hide the cost by speeding up the code first All programmers will be performance programmers??? Some may eventually be hidden in libraries, compilers, and high level languages But a lot of work is needed to get there Big open questions: What will be the killer apps for multicore machines How should the chips be designed, and how will they be programmed? Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 35 Phases of Supercomputing (Parallel) Architecture Phase 1 (1950s): sequential instruction execution Phase 2 (1960s): sequential instruction issue Pipeline execution, reservations stations Instruction Level Parallelism (ILP) Phase 3 (1970s): vector processors Pipelined arithmetic units Registers, multi-bank (parallel) memory systems Phase 4 (1980s): SIMD and SMPs Phase 5 (1990s): MPPs and clusters Communicating sequential processors Phase 6 (>2000): many cores, accelerators, scale, Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 36

Spring 2018 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 37