Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Similar documents
CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Introduction to Parallel Computing

Computer Architecture

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Architectures

CIT 668: System Architecture

Computer Architecture Spring 2016

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Lect. 2: Types of Parallelism

Computer Systems Architecture

Introduction II. Overview

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Online Course Evaluation. What we will do in the last week?

Computer Systems Architecture

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Parallel Computing Introduction

Module 5 Introduction to Parallel Processing Systems

Processor Performance and Parallelism Y. K. Malaiya

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Parallel Computing Platforms

Multiprocessors & Thread Level Parallelism

WHY PARALLEL PROCESSING? (CE-401)

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Parallel Systems I The GPU architecture. Jan Lemeire

Comp. Org II, Spring

Parallelism. CS6787 Lecture 8 Fall 2017

Multi-core Architectures. Dr. Yingwu Zhu

Parallel Computing: Parallel Architectures Jin, Hai

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Chap. 4 Multiprocessors and Thread-Level Parallelism

Comp. Org II, Spring

CSE 392/CS 378: High-performance Computing - Principles and Practice

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Multicore and Parallel Processing

Parallel Processing & Multicore computers

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

CIT 668: System Architecture. Computer Systems Architecture

Introduction to Parallel Programming

Lec 25: Parallel Processors. Announcements

Chapter 18 - Multicore Computers

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Handout 3 Multiprocessor and thread level parallelism

Introduction to parallel computing

Multiprocessors - Flynn s Taxonomy (1966)

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Lecture 7: Parallel Processing

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Introduction to Parallel Programming

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

An Introduction to Parallel Programming

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

CSE502: Computer Architecture CSE 502: Computer Architecture

Course web site: teaching/courses/car. Piazza discussion forum:

Lecture 9: MIMD Architectures

High Performance Computing in C and C++

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

COSC 6385 Computer Architecture - Multi Processor Systems


The Art of Parallel Processing

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Parallel Computing Why & How?

Introduction to High-Performance Computing

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Multicore Hardware and Parallelism

Computer Systems Architecture Spring 2016

Top500 Supercomputer list

! Readings! ! Room-level, on-chip! vs.!

RAID 0 (non-redundant) RAID Types 4/25/2011

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

BlueGene/L (No. 4 in the Latest Top500 List)

CS 654 Computer Architecture Summary. Peter Kemper

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

UNIT I (Two Marks Questions & Answers)

Parallel and High Performance Computing CSE 745

Keywords and Review Questions

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computing Architectures

45-year CPU Evolution: 1 Law -2 Equations

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

Computer Architecture Crash course

THREAD LEVEL PARALLELISM

Transcription:

CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6. Parallel Memory Architectures 7. Parallel Programming Models Images from LLNL Parallel Computing Tutorial, Wikipedia, or Majd F. Sakr s parallel computation lectures unless otherwise noted. Serial and Parallel Computation Serial Parallel 1

History of Parallel Computation Parallel Computation Breaking a problem into pieces and using multiple computer resources to solve each piece of the problem and reassemble the solution pieces into the final answer. Parallelism is limited by data dependencies Data Dependencies 2

Data Dependencies Parallel Terminology Task: A logically discrete section of computational work. A task is typically a thread or process at the OS level. Communications: Parallel tasks typically need to exchange data through a shared memory busy or over a network. Synchronization: Coordination of parallel tasks in real time. Often implemented by establishing a synchronization point within an application where task cannot proceed further until other tasks reach that point. Scalability: Ability of parallel system to demonstrate aproportionate increase in speed with addition of more processors. Embarrassingly Parallel: Solving many similar but independent tasks in parallel with little or no need for coordination between tasks. Parallel Granularity Fine grain Relatively small amounts of computation done between communication events. Low computation to communication ratio. Facilitates load balancing. Coarse grain Relatively large amounts of computation done between communication events. High computation to communication ratio. Difficult to load balance. 3

Why Parallel Computing? The Real World is Parallel Modeling Science & Engineering Problems 4

Reasons for Parallel Computation Limits to serial computing CPU speeds increased slowly >2003 Solve problems faster Reduce time by using more resources Solve larger problems Scientific problems Web-scale applications Bit-level Parallelism Instruction-level Parallelism Data-level Parallelism Task-level Parallelism Types of Parallelism The Processor The Brain: a functional unit that interprets and carries out instructions (mathematical operations) Also called a CPU (actually includes CPU + ALU) Consists of hundreds of millions of transistors. 5

Processor Components: Control Control Unit Processor s supervisor Fetch/execute cycle Fetch Store Decode Execute Program Counter (PC): stores address of instruction to be fetched. Instruction Register (IR): has instruction most recently fetched. Processor Components: Datapath Register File: General purpose storage locations inside processor that hold addresses or values. Arithmetic Logic Unit: Set of functional units that perform arithmetic and logic operations. A 32-bit processor has registers that are typically 32 bits in size. Bit-level Parallelism Increase processor word size to operate on more bits at once. Task: add two 64-bit numbers 32-bit CPU: must complete two 32-bit operations plus handle carry operations 64-bit CPU: adds two 64- bit ints in single instruction 6

Evolution of Processor Word Size Instruction Level Parallelism (ILP) Running independent instructions on separate execution units simultaneously. x = a + b y = c + d z = x + y Serial Execution: If each instruction takes one cycle, it takes 3 clock cycles to run program. Parallel Execution: First two programs are independent, so can be executed simultaneously. Third instruction depends on first two, so must be executed afterwards. Two clock cycles to run program. Instruction Pipelining Improve ILP by splitting processing of a single machine language instruction into a series of independent steps. CPU can issue instructions at processing rate of slowest step, increasing clock speed. IF = Instruction Fetch ID = Instruction Decode EX = Execute MEM = Memory access WB = Register write back 7

Sequential Laundry 6 PM 7 8 9 10 11 12 1 2 AM T a s k O r d e r A B C D 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 Time Sequential laundry = 8 hours for 4 loads http://inst.eecs.berkeley.edu/~cs61c/fa10 Pipelined Laundry 6 PM 7 8 9 10 11 12 1 2 AM T a s k O r d e r A B C D 30 30 30 30 30 30 30 Time Pipelined laundry = 3.5 hours for 4 loads! http://inst.eecs.berkeley.edu/~cs61c/fa10 T a s k O r d e r 6 PM 7 8 9 A B C D Pipelining Lessons Time 30 30 30 30 30 30 30 Pipelining doesn t decrease latency of single task; it increases throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number of stages Time to fill pipeline and time to drain it reduces speedup: 2.3X v. 4X in this example Speedup limited by slowest pipeline stage. http://inst.eecs.berkeley.edu/~cs61c/fa10 8

Processor Pipeline Execution P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 P 11 P 12 Instr 1 IF ID ALU MEM WR Instr 2 IF ID ALU MEM WR IF ID ALU MEM WR Instr 3 IF ID ALU MEM WR Instr 4 Instr 5 Instr 6 Instr 7 Instr 8 IF ID ALU MEM WR IF ID ALU MEM WR IF ID ALU MEM WR IF ID ALU MEM WR IF ID ALU MEM WR http://inst.eecs.berkeley.edu/~cs61c/fa10 Hazards Problems that prevent pipeline execution. Data hazards Must wait for previous instruction to compute data that current instruction needs for input. Structural hazards A processor component (execution unit or memory unit) is needed by two simultaneous instructions. Control hazards Conditional branch statements offer two alternatives for the next instruction to fetch. Working Around Hazards Instruction re-ordering Re-order instructions to extract more ILP Data dependencies limit re-ordering Branch prediction Predict which branch is likely to be taken then execute those instructions Loops usually continue in loop; only exit once Loop unrolling Human or compiler replicates body of loop to expose more parallelism 9

Superscalar Pipelined Execution Superscalar Execution http://arstechnica.com/paedia/c/cpu/part-2/ AMD Athlon Architecture Superscalar Pipeline 10

Pipeline Depth and Issue Width CPU Year Clock Speed Pipeline Stages Issue Width Cores 80486 1989 25 MHz 5 1 1 5W Power Pentium 1993 66 MHz 5 2 1 10W Pentium Pro 1997 150 MHz 10 3 1 29W Pentium 4 2001 1500 MHz 22 3 1 75W Pentium 4 Prescott 2004 3600 MHz 31 3 1 103W Core 2 Conroe 2006 2930 MHz 14 4 2 75W Core 2 Yorkfield 2008 2930 MHz 16 4 4 95W Core i7 Gulftown 2010 3460 MHz 16 4 6 130W Pipeline Depth and Issue Width 10000 1000 100 10 Clock Power Pipeline Stages Issue width Cores 1 1989 1992 1995 1998 2001 2004 2007 2010 http://inst.eecs.berkeley.edu/~cs61c/fa10 Data Parallelism Distribute data across different computing nodes, so same operation performed on different parts of same data structure AKA loop-level parallelism If each loop iteration depends on results from previous iteration, loop cannot be parallelized 11

Task Parallelism Different operations on different data sets Each processor performs different task Each processor communicates with other processes to get inputs and put results Multiple CPUs Multicore Multicore CPU chips contain multiple complete processors Individual L1 and shared L2 caches OS and applications see each core as an independent processor Each core can run a separate task A single application must be divided into multiple tasks to improve performance 12

Multicore Organization Alternatives Core 2 Duo and Core i7 Architectures High Core CPUs Use On-Chip Network Low count cores use ~1000 wires per core to connect to L3 cache and each other. To scale, Core i7 uses a ring bus to communicate with cores and L3 cache slices. Core i7 2600k Ring Bus Architecture 13

Multicore vs. Multiprocessor Inter-processor communication Memory bandwidth per CPU Power consumption per CPU Multicore Faster Lower Lower Multiprocessor Slower Higher Higher Cost per CPU Lower Higher Simultaneous Multi-Threading CPU presents virtual cores to OS CPU duplicates PC, IR, and thread state registers But keeps same number of execution units OS feeds two threads at once to CPU Improves ILP by having multiple instruction streams that are unlikely to have cross-stream dependencies CPU Parallelism Architectures Compared Parallelism types combined to increase parallel capacity. x86 cores have been superscalar since Pentium in 1993. Server x86 CPUs use simultaneous multithreading since Pentium 4 Xeon in 2000. 14

Flynn s Taxonomy Flynn s Taxonomy Single Instruction Multiple Instruction Single Data SISD: Pentium III MISD: None today Multiple Data SIMD: SSE instruction set MIMD: Xeon e5345 (Clovertown) Single Instruction Single Data Serial computation Single instruction = only one instruction stream acted on by CPU during one clock cycle Single data = only one data stream used as input during any clock cycle 15

Single Instruction Multiple Data Parallel computation Single instruction = All processing units given same instruction at any clock cycle. Multiple data = Each processing unit can operate on a different data element Applications: graphics Radeon R770 GPU Multiple Instruction Single Data Parallel computation Single data = A single data stream is fed into multiple processing units Multiple instruction = Each processing unit operates on data independently via its own instruction stream Multiple Instruction Multiple Data Parallel computation Multiple instruction = each processing unit may execute a different instruction stream Multiple data = each processing unit may work with a different data stream Examples: multicore, grid computing 16

Taxonomy of Parallel Architectures Amdahl s Law Scaling Parallel Computing Just add more processors? Hardware limitations Memory-CPU bus bandwidth on local machine Network bandwidth and latency Software limitations Most algorithms have limits to scalability Supporting libraries may have their own limits 17

Amdahl s Law Speedup due to enhancement E is Execution time without E Speedup = Execution time with E Suppose E accelerates a piece P (P<1) of task by a factor S (S>1) and remainder unaffected Exec time with E = Exec time w/o E [ 1 - P + P/S ] Speedup = 1 1 P + P/S Amdahl s Law: Example Consider an application whose work is divided into the following four components: Work Load Time Memory Access Computation Disk Access Network Access 10% 70% 10% 10% What is the expected percent improvement if: Memory access speed is doubled? Computation speed is doubled? 5% 35% Amdahl s Law for Parallelization Speedup = 1 1 P + P/S Let P be the parallelizable portion of code As the number of processors increases, the time to do the parallel portion of the program, P/S tends towards zero, reducing the equation to: 1 Speedup = 1 P If P=0, then speedup=1 (no improvement) If P=1, then speedup grows without limit. If P=0.5, then maximum speed is 2. 18

Amdahl s Law: Parallel Example Consider an application whose work is divided into the following four functions: Work Load f1 f2 f3 f4 Time 4% 10% 80% 6% Assume f1, f3, and f4 can be parallelized, but f2 must be computed serially. Parallelizing which part would best improve performance? What is the best performance speedup that could be reached by parallelizing all three parallelizable functions? f3 10X Amdahl s Law: Time Example Consider an application whose work is divided into the following four functions: Work Load f1 f2 f3 f4 Time 2ms 5ms 40ms 3ms Assume f1, f3, and f4 can be parallelized, but f2 must be computed serially. Assume that running the whole program takes 50ms. What is the best running time that can be achieved by parallelizing f1, f3, and f4? Why can t parallelizing the program decrease the total running time below that time? 5ms 5ms is the time required for serial part. Even if parallel part takes 0ms, f2 still takes 5ms to run. Amdahl s Law 19

Uniform Memory Architecture Non-Uniform Memory Architecture Distributed Memory Hybrid Distributed-Shared Memory Parallel Memory Architectures Uniform Memory Architecture (UMA) Global shared address space for all memory Symmetric Multi-Processing (SMP) Does not scale to much greater than 8 CPUs due to memory contention Non-Uniform Memory Architecture (NUMA) Global shared address space where Processors have fast access to nearby memory Memory access across links is slower Better scaling than UMA but still limited by memory contention 20

Distributed Memory Each CPU has own local address space; changes by each CPU are not visible to other CPUs CPUs must use network to exchange data Highly scalable: CPUs have fast local RAM Data communication is programmer s responsibility Hybrid Distributed-Shared Memory Shared memory components are SMP nodes Distributed memory is network of SMP nodes Used by most large supercomputers + clouds Hybrid Distributed-Shared in Cloud 21

Shared Memory Model Threads Model Data Parallel Model Message Passing Model Parallel Programming Models Parallel Programming Models An abstraction above the hardware level for programmers to use. Any programming model can be used with any parallel architecture. Model performance may depend on architecture. Model choice depends on problem being solved and programmer preference. Shared Memory Tasks share common global address space, which they read and write to asynchronously. Software mechanisms such as locks and semaphores used to control access to shared memory. 22

Shared Memory Model Advantages Program development simplified since process owns data stored in memory. Referencing data in shared memory is similar to traditional serial programming. Disadvantages Difficult to understand and manage data locality. Keeping data local to the CPU working on it is faster, but bus traffic results when other CPUs try to access that data. Threads Divide single program into multiple concurrent execution paths called threads. Threads are distributed across CPUs and can be executed on multiple CPUs simultaneously. Threads Each thread has local data structures specific to this thread. However, all threads share common process global memory space. Threads are associated with shared memory architectures. Threads can be scheduled by OS or by middleware such as the language VM (green threads). Green threads start and synchronize faster But most implementations cannot use multiple CPUs 23

Data Parallel Data set is divided into chunks and operations are performed on each chunk concurrently Tasks carried out on same part of data structure perform same operations on each instance of data (ex: multiply each array element by X) Tasks use only their own local memory. Tasks can be on multiple machines. Data exchanged between tasks by sending and receiving messages. Data transfer requires cooperation between tasks. Message Passing Parallel Code Example 24

Example: Array Processing if MASTER initialize array send each WORKER info on chunk it owns chunk of initial array end recv results from workers elsif SLAVE recv info on my chunk recv array data do j=1 st col..last col do i = 1,n a(i,j) = f(i,j) end end end Array is divided into chunks; each CPU owns a chunk and executes the portion of loop corresponding to it. Example: Heat Equation The Heat Equation describes change in temperature in a region over time, given initial temperature distribution and boundary conditions. Divide region into chunks and iterate, allowing heat from nearby chunks to change temperature in chunk for next iteration. To compute U x,y Example: Heat Equation Serial program: 25

Example: Parallel Heat Equation Solver Divide array into chunks. Data dependencies: Interior elements are independent of other tasks. Border elements are dependent on other tasks, so tasks must communicate. Master sends initial state to worker tasks and collects results from workers. Workers compute heat equation, communicating state of border elements with adjacent worker tasks. Parallel Code Key Points Granularity refers to ratio of comp to comm. Levels of Parallelism Bit-level parallelism Data-level parallelism (loop-level) Instruction-level parallelism Pipelining Superscalar Task-level parallelism Multi-CPU Multi-core Hyperthreading Flynn s Taxonomy: SIMD, MIMD 26

Key Points Amdahl s Law No matter how many processors, speed-up is limited by sequential portion of program Parallel Memory Architectures Shared memory for SMP: UMA, NUMA Distributed memory for clusters or large-scale MP Most clouds use hybrid shared-distributed arch Parallel Programming Models Shared Memory Threads Data Parallel Message Passing 27