Lecture #30 Parallel Computing

Similar documents
CS61C : Machine Structures

Outline. Lecture 40 Hardware Parallel Computing Thanks to John Lazarro for his CS152 slides inst.eecs.berkeley.

UC Berkeley CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

Lecture #26 Parallelism in Hardware: Processor Design

UC Berkeley CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

MapReduce: Simplified Data Processing on Large Clusters

CS61C : Machine Structures

Lecture #6 Intro MIPS; Load & Store

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

CS61C : Machine Structures

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

xx.yyyy Lecture #11 Floating Point II Summary (single precision): Precision and Accuracy Fractional Powers of 2 Representation of Fractions

CS 61C: Great Ideas in Computer Architecture. MapReduce

UC Berkeley CS61C : Machine Structures

CS61C : Machine Structures

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

The MapReduce Abstraction

UC Berkeley CS61C : Machine Structures

The Beauty and Joy of Computing

CS61C : Machine Structures

CS10 The Beauty and Joy of Computing

UC Berkeley CS61C : Machine Structures

CS61C : Machine Structures

CS61C : Machine Structures

The Beauty and Joy of Computing

Review. N-bit adder-subtractor done using N 1- bit adders with XOR gates on input. Lecture #19 Designing a Single-Cycle CPU

Clever Signed Adder/Subtractor. Five Components of a Computer. The CPU. Stages of the Datapath (1/5) Stages of the Datapath : Overview

CS61CL : Machine Structures

Assembler. #13 Running a Program II

CS61C : Machine Structures

CS61C : Machine Structures

CS61C : Machine Structures

Parallel Computing: MapReduce Jin, Hai

Brought to you by CalLUG (UC Berkeley GNU/Linux User Group). Tuesday, September 20, 6-8 PM in 100 GPB.

Lecture #6 Intro MIPS; Load & Store Assembly Variables: Registers (1/4) Review. Unlike HLL like C or Java, assembly cannot use variables

Database System Architectures Parallel DBs, MapReduce, ColumnStores

CS61C : Machine Structures

Distributed Computations MapReduce. adapted from Jeff Dean s slides

CS61C : Machine Structures

Multiple Issue ILP Processors. Summary of discussions

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

CS61C : Machine Structures

Lecture #27 Parallelism in Software

CS 152 Computer Architecture and Engineering

UCB CS61C : Machine Structures

CS61C : Machine Structures

CS61C : Machine Structures

CS61C : Machine Structures

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Another View of the Memory Hierarchy. Lecture #25 Virtual Memory I Memory Hierarchy Requirements. Memory Hierarchy Requirements

IEEE Standard 754 for Binary Floating-Point Arithmetic.

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

CS 152 Computer Architecture and Engineering

Block Size Tradeoff (1/3) Benefits of Larger Block Size. Lecture #22 Caches II Block Size Tradeoff (3/3) Block Size Tradeoff (2/3)

CS61C : Machine Structures

! CS61C : Machine Structures. Lecture 22 Caches I. !!Instructor Paul Pearce! ITʼS NOW LEGAL TO JAILBREAK YOUR PHONE!

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

CS61C : Machine Structures

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

The NoSQL Movement. FlockDB. CSCI 470: Web Science Keith Vertanen

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Written Exam / Tentamen

CS 152 Computer Architecture and Engineering

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Where Are We Now? Linker (1/3) Linker (2/3) Four Types of Addresses we ll discuss. Linker (3/3)

Online Course Evaluation. What we will do in the last week?

Multithreading: Exploiting Thread-Level Parallelism within a Processor

I-Format Instructions (3/4) Define fields of the following number of bits each: = 32 bits

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Review. Disassembly is simple and starts by decoding opcode field. Lecture #13 Compiling, Assembly, Linking, Loader I

UC Berkeley CS61C : Machine Structures

Lecture 25: Board Notes: Threads and GPUs

CS61C : Machine Structures

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

Review. Agenda. Request-Level Parallelism (RLP) Google Query-Serving Architecture 8/8/2011

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS 152 Computer Architecture and Engineering

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

MapReduce-style data processing

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Chapter 18 - Multicore Computers

www-inst.eecs.berkeley.edu/~cs61c/

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CEC 450 Real-Time Systems


HPC VT Machine-dependent Optimization

CS 152 Computer Architecture and Engineering

Computer Systems Architecture

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Database Systems CSE 414

Transcription:

CS61C L30 Parallel Computing (1) inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #30 Parallel Computing 2007-8-15 Scott Beamer, Instructor Ion Wind Cooling Developed by Researchers at Purdue Review of Software Parallelism Parallelism is necessary It looks like the future of computing It is unlikely that serial computing will ever catch up with parallel computing Software parallelism Grids and clusters, networked computers Two common ways to program: Message Passing Interface (lower level) MapReduce (higher level, more constrained) Parallelism is often difficult Speedup is limited by serial portion of code and communication overhead www.bbc.co.uk CS61C L30 Parallel Computing (2) A New Hope: Google s MapReduce Remember CS61A? (reduce + (map square '(1 2 3)) (reduce + '(1 4 9)) 14 We told you the beauty of pure functional programming is that it s easily parallelizable Do you see how you could parallelize this? What if the reduce function argument were associative, would that help? Imagine 10,000 machines ready to help you compute anything you could cast as a MapReduce problem! This is the abstraction Google is famous for authoring (but their reduce not the same as the CS61A s or MPI s reduce) Builds a reverse-lookup table It hides LOTS of difficulty of writing parallel code! The system takes care of load balancing, dead machines, etc. MapReduce Programming Model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usu just one) code.google.com/edu/parallel/mapreduce-tutorial.html CS61C L30 Parallel Computing (3) CS61C L30 Parallel Computing (4) MapReduce Code Example MapReduce Example Diagram map(string input_key, String input_value): // input_key : document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(string output_key, Iterator intermediate_values): // output_key : a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); Mapper nodes are responsible for the map function Reducer nodes are responsible for the reduce function Data on a distributed file system (DFS) map(string input_key, String input_value): // input_key : doc name // input_value: doc contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(string output_key, Iterator intermediate_values): // output_key : a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); file 1 file 2 file 3 file 4 file 5 file 6 file 7 ah ah er ah if or or uh or ah if ah:1 ah:1 er:1 ah:1 if:1 or:1 or:1 uh:1 or:1 ah:1 if:1 ah:1,1,1,1 er:1 if:1,1or:1,1,1 uh:1 CS61C L30 Parallel Computing (5) CS61C L30 Parallel Computing (6) 4 1 2 3 1 (ah) (er) (if) (or) (uh)

CS61C L30 Parallel Computing (7) MapReduce Advantages/Disadvantages Now it s easy to program for many CPUs Communication management effectively gone I/O scheduling done for us Fault tolerance, monitoring machine failures, suddenly-slow machines, other issues are handled Can be much easier to design and program! But it further restricts solvable problems Might be hard to express some problems in a MapReduce framework Data parallelism is key Need to be able to break up a problem by data chunks MapReduce is closed-source Hadoop! Introduction to Hardware Parallelism Given many threads (somehow generated by software), how do we implement this in hardware? Recall the performance equation: Execution Time = (Inst. Count)(CPI)(Cycle Time) Hardware Parallelism improves: Instruction Count - If the equation is applied to each CPU, each CPU needs to do less CPI - If the equation is applied to system as a whole, more is done per cycle Cycle Time - Will probably be made worse in process CS61C L30 Parallel Computing (8) Disclaimers Please don t let today s material confuse what you have already learned about CPU s and pipelining When programmer is mentioned today, it means whoever is generating the assembly code (so it is probably a compiler) Many of the concepts described today are difficul to implement, so if it sounds easy, think of possible hazards Flynn s Taxonomy Classifications of parallelism types Single Instruction Multiple Instruction Single Data Multiple Data CS61C L30 Parallel Computing (9) CS61C L30 Parallel Computing (10) wwww.wikipedia.org Superscalar Simple Superscalar MIPS CPU Add more functional units or pipelines to CPU Directly reduces CPI by doing more per cycle Consider what if we: Added another Added 2 more read ports to the RegFile Added 1 more write port to the RegFile Instruction Memory Instruction ess Next ess PC clk Inst1 Inst0 Rd Rs Rt Rd Rs Rt 5 5 5 5 5 5 A W0RaRb W1RcRd Register File B clk C Can now do 2 instructions in 1 cycle! D Data Data In Data Memory clk CS61C L30 Parallel Computing (11) CS61C L30 Parallel Computing (12)

CS61C L30 Parallel Computing (13) Simple Superscalar MIPS CPU (cont.) Considerations ISA now has to be changed Forwarding for pipelining now harder Limitations Programmer must explicitly generate parallel code Improvement only if other instructions can fill slots Doesn t scale well Single Instruction Multiple Data (SIMD) Often done in a vector form, so all data has the same operation applied to it Example: AltiVec (like SSE) 128bit registers can hold: 4 floats, 4 ints, 8 shorts, 16 chars, etc. Processes whole vector A B 128 128 CS61C L30 Parallel Computing (14) Superscalar in Practice Administrivia ISA s have extensions for these vector operations One thread, that has parallelism internally Performance improvement depends on program and programmer being able to fully utilize all slots Can be parts other than (like load) Usefulness will be more apparent when combined with other parallel techniques Put in regrade requests now for any assignment past HW2 Final: Thursday 7-10pm @ 10 Evans NO backpacks, cells, calculators, pagers, PDAs 2 writing implements (we ll provide write-in exam booklets) pencils ok! Two pages of notes (both sides) 8.5 x11 paper One green sheet Course Survey last lecture, 2pts for doing it CS61C L30 Parallel Computing (15) CS61C L30 Parallel Computing (16) Thread Review Multithreading A Thread is a single stream of instructions It has its own registers, PC, etc. Threads from the same process operate in the same virtual address space Are an easy way to describe/think about parallelism A single CPU can execute many threads by Time Division Multipexing CPU Time Thread0 Thread1 Thread2 Multithreading is running multiple threads through the same hardware Could we do Time Division Multipexing better in hardware? Consider if we gave the OS the abstraction of having 4 physical CPU s that share memory and each executes one thread, but we did it all on 1 physical CPU? CS61C L30 Parallel Computing (17) CS61C L30 Parallel Computing (18)

CS61C L30 Parallel Computing (19) Static Multithreading Example Static Multithreading Example Analyzed Appears to be 4 CPU s at 1/4 clock Results: 4 Threads running in hardware Pipeline hazards reduced No more need to forward Introduced in 1964 by Seymour Cray No control issues Less structural hazards Depends on being able to fully generate 4 threads evenly Pipeline Stage Example if 1 Thread does 75% of the work Utilization = (% time run)(% work done) = (.25)(.75) + (.75)(.25) =.375 = 37.5% CS61C L30 Parallel Computing (20) Dynamic Multithreading Dynamic Multithreading Example Adds flexibility in choosing time to switch thread Simultaneous Multithreading (SMT) Called Hyperthreading by Intel Run multiple threads at the same time Just allocate functional units when available Superscalar helps with this One thread, 8 units CycleM M FX FX FPFPBRCC 1 2 3 4 5 6 7 8 9 Two threads, 8 units CycleM M FX FXFP FPBRCC 1 2 3 4 5 6 7 8 9 CS61C L30 Parallel Computing (21) CS61C L30 Parallel Computing (22) Multicore Put multiple CPU s on the same die Why is this better than multiple dies? Smaller Cheaper Closer, so lower inter-processor latency Can share a L2 Cache CS61C L30 Parallel Computing (23) (details) Less power (power ~ freq^2) Cost of multicore: complexity and slower single-thread execution Two CPUs, two caches, shared DRAM... 16 CPU0 Cache Value 16 5 CS61C L30 Parallel Computing (24) Shared Main Memory CPU1 Cache Value 5 0 Value 16 5 0 Write-through caches CPU0: LW R2, 16(R0) CPU1: LW R2, 16(R0) CPU1: SW R0,16(R0) View of memory no longer coherent. Loads of location 16 from CPU0 and CPU1 see different values!

Multicore Example (IBM Power5) Real World Example 1: Cell Processor Core #1 Shared Stuff Core #2 Multicore, and more. CS61C L30 Parallel Computing (25) CS61C L30 Parallel Computing (26) Real World Example 1: Cell Processor Real World Example 1: Cell Processor 9 Cores (1PPE, 8SPE) at 3.2GHz Great for other multimedia applications such as HDTV, cameras, etc Power Processing Element (PPE) Supervises all activities, allocates work Is multithreaded (2 threads) Synergystic Processing Element (SPE) Where work gets done Very Superscalar No Cache, only Local Store CS61C L30 Parallel Computing (27) Really dependent on programmer use of SPE s and Local Store to get the most out of it CS61C L30 Parallel Computing (28) Real World Example 2: Niagara Processor Real World Example 2: Niagara Processor Multithreaded and Multicore Each thread runs slower (1.2GHz), and there is less number crunching ability (no FP unit), but tons of threads Threads (8 cores, 4 threads each) at 1.2GHz Designed for low power Has simpler pipelines to fit more on Maximizes thread level parallelism Project Blackbox CS61C L30 Parallel Computing (29) This is great for webservers, where there are typically many simple requests, and many data stalls Can beat faster and more expensive CPU s, while using less power CS61C L30 Parallel Computing (30)

CS61C L30 Parallel Computing (31) Rock: Niagara s Successor Released last week 64 Threads (8 cores, 8 threads each) 8 FPU s, 8 Crypto Coprocessors Integrated 10GbE and PCIe hardware Supports 64 Logical Domains (for 64 virtual OS s) Only 20 months later Peer Instruction 1. The majority of PS3 s processing power comes from the Cell processor 2. A computer that has max utilization can get more done multithreaded 3. Current multicore techniques can scale well to many (+) cores CS61C L30 Parallel Computing () ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT Summary Superscalar: More functional units Multithread: Multiple threads executing on same CPU Multicore: Multiple CPU s on the same die The gains from all these parallel hardware techniques relies heavily on the programmer being able to map their task well to multiple threads Hit up CS150, CS152, CS162, 194-3, 198-5 and wikipedia for more info CS61C L30 Parallel Computing (34)