UC Berkeley CS61C : Machine Structures

Similar documents
Outline. Lecture 40 Hardware Parallel Computing Thanks to John Lazarro for his CS152 slides inst.eecs.berkeley.

CS61C : Machine Structures

Lecture #30 Parallel Computing

UC Berkeley CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

Lecture #26 Parallelism in Hardware: Processor Design

CS61C : Machine Structures

CS61C : Machine Structures

CS61C : Machine Structures

CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

CS61C : Machine Structures

CS61C : Machine Structures

xx.yyyy Lecture #11 Floating Point II Summary (single precision): Precision and Accuracy Fractional Powers of 2 Representation of Fractions

CS 152 Computer Architecture and Engineering

CS61C : Machine Structures

CS61C : Machine Structures

CS61C : Machine Structures

Review. Motivation for Input/Output. What do we need to make I/O work?

Review. N-bit adder-subtractor done using N 1- bit adders with XOR gates on input. Lecture #19 Designing a Single-Cycle CPU

CS61C : Machine Structures

CS61C : Machine Structures

CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

CS61C : Machine Structures

CS 152 Computer Architecture and Engineering

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

Lecture #6 Intro MIPS; Load & Store

www-inst.eecs.berkeley.edu/~cs61c/

CS 152 Computer Architecture and Engineering

CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

CS61CL : Machine Structures

CS 152 Computer Architecture and Engineering

Lecture 33 Caches III What to do on a write hit? Block Size Tradeoff (1/3) Benefits of Larger Block Size

Where Are We Now? Linker (1/3) Linker (2/3) Four Types of Addresses we ll discuss. Linker (3/3)

Review. Pipeline big-delay CL for faster clock Finite State Machines extremely useful You ll see them again in 150, 152 & 164

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

UCB CS61C : Machine Structures

CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

UCB CS61C : Machine Structures

CS 152 Computer Architecture and Engineering

CS61C : Machine Structures

IEEE Standard 754 for Binary Floating-Point Arithmetic.

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Processor. Lecture #2 Number Rep & Intro to C classic components of all computers Control Datapath Memory Input Output }

Review. Disassembly is simple and starts by decoding opcode field. Lecture #13 Compiling, Assembly, Linking, Loader I

UC Berkeley CS61C : Machine Structures

CS61C : Machine Structures

Assembler. #13 Running a Program II

Block Size Tradeoff (1/3) Benefits of Larger Block Size. Lecture #22 Caches II Block Size Tradeoff (3/3) Block Size Tradeoff (2/3)

Clever Signed Adder/Subtractor. Five Components of a Computer. The CPU. Stages of the Datapath (1/5) Stages of the Datapath : Overview

CS61C : Machine Structures

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CS61C : Machine Structures

CS61C : Machine Structures

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

IEEE Standard 754 for Binary Floating-Point Arithmetic.

CS61C : Machine Structures

! CS61C : Machine Structures. Lecture 22 Caches I. !!Instructor Paul Pearce! ITʼS NOW LEGAL TO JAILBREAK YOUR PHONE!

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Brought to you by CalLUG (UC Berkeley GNU/Linux User Group). Tuesday, September 20, 6-8 PM in 100 GPB.

Lecture #6 Intro MIPS; Load & Store Assembly Variables: Registers (1/4) Review. Unlike HLL like C or Java, assembly cannot use variables

I-Format Instructions (3/4) Define fields of the following number of bits each: = 32 bits

Multiple Issue ILP Processors. Summary of discussions

CS61C : Machine Structures

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Speed. Lecture #39 Writing Really Fast Programs Example 1 bit counting Basic Idea

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Review! Lecture 5 C Memory Management !

CS61C : Machine Structures

How to Write Fast Code , spring th Lecture, Mar. 31 st

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Lecture 7 More Memory Management Slab Allocator. Slab Allocator

CS61C : Machine Structures

CS61C : Machine Structures

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CS61C : Machine Structures

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CS 152 Computer Architecture and Engineering

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

CS61C : Machine Structures

Another View of the Memory Hierarchy. Lecture #25 Virtual Memory I Memory Hierarchy Requirements. Memory Hierarchy Requirements

UCB CS61C : Machine Structures

CS 152 Computer Architecture and Engineering

Review : Pipelining. Memory Hierarchy

Direct-Mapped Cache Terminology. Caching Terminology. TIO Dan s great cache mnemonic. Accessing data in a direct mapped cache

Computer Architecture. Fall Dongkun Shin, SKKU

CS61C : Machine Structures

CS61C : Machine Structures

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Transcription:

inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures Lecture 40 Hardware Parallel Computing 2006-12-06 Thanks to John Lazarro for his CS152 slides inst.eecs.berkeley.edu/~cs152/ Head TA Scott Beamer inst.eecs./~cs61c-tb Spam Emails on the Rise New studies show spam emails have increased by 50% worldwide. Worse yet, much of the spamming is being done by criminals, so laws may not help. http://www.cnn.com/2006/world/europe/11/27/uk.spam.reut/index.html CS61C L40 Hardware Parallel Computing (1)

Outline Last time was about how to exploit parallelism from the software point of view Today is about how to implement this in hardware Some parallel hardware techniques A couple current examples Won t cover out-of-order execution, since too complicated CS61C L40 Hardware Parallel Computing (2)

Introduction Given many threads (somehow generated by software), how do we implement this in hardware? Recall the performance equation: Execution Time = (Inst. Count)(CPI)(Cycle Time) Hardware Parallelism improves Instruction Count - If the equation is applied to each CPU, each CPU needs to do less CPI - If the equation is applied to system as a whole, more is done per cycle Cycle Time - Will probably be made worse in process CS61C L40 Hardware Parallel Computing (3)

Disclaimers Please don t let today s material confuse what you have already learned about CPU s and pipelining When programmer is mentioned today, it means whoever is generating the assembly code (so it is probably a compiler) Many of the concepts described today are difficult to implement, so if it sounds easy, think of possible hazards CS61C L40 Hardware Parallel Computing (4)

Superscalar Add more functional units or pipelines to CPU Directly reduces CPI by doing more per cycle Consider what if we: Added another ALU Added 2 more read ports to the RegFile Added 1 more write port to the RegFile CS61C L40 Hardware Parallel Computing (5)

Simple Superscalar MIPS CPU Instruction Address Next Address Instruction Memory PC clk Inst1 Inst0 Rd 5 W0RaRb Register File clk Rs Rt 5 5 Rd 5 RsRt 5 5 W1 RcRd C D Can now do 2 instruction in 1 cycle! A 32 B 32 32 ALU ALU 32 32 Data Addr Data In Data Memory clk 32 CS61C L40 Hardware Parallel Computing (6)

Simple Superscalar MIPS CPU (cont.) Considerations ISA now has to be changed Forwarding for pipelining now harder Limitations Programmer must explicitly generate parallel code Improvement only if other instructions can fill slots Doesn t scale well CS61C L40 Hardware Parallel Computing (7)

Single Instruction Multiple Data (SIMD) Often done in a vector form, so all data has the same operation applied to it Example: AltiVec (like SSE) 128bit registers can hold: 4 floats, 4 ints, 8 shorts, 16 chars, etc. Processes whole vector A B 128 ALU 128 CS61C L40 Hardware Parallel Computing (8)

Superscalar in Practice ISA s have extensions for these vector operations One thread, that has parallelism internally Performance improvement depends on program and programmer being able to fully utilize all slots Can be parts other than ALU (like load) Usefulness will be more apparent when combined with other parallel techniques CS61C L40 Hardware Parallel Computing (9)

Thread Review A Thread is a single stream of instructions It has its own registers, PC, etc. Threads from the same process operate in the same virtual address space Are an easy way to describe/think about parallelism A single CPU can execute many threads by Time Division Multipexing CPU Time CS61C L40 Hardware Parallel Computing (10) Thread0 Thread1 Thread2

Multithreading Multithreading is running multiple threads through the same hardware Could we do Time Division Multipexing better in hardware? Consider if we gave the OS the abstraction of having 4 physical CPU s that share memory and each execute one thread, but we did it all on 1 physical CPU? CS61C L40 Hardware Parallel Computing (11)

Static Multithreading Example Appears to be 4 CPU s at 1/4 clock Introduced in 1964 by Seymour Cray ALU Pipeline Stage CS61C L40 Hardware Parallel Computing (12)

Static Multithreading Example Analyzed Results: 4 Threads running in hardware Pipeline hazards reduced No more need to forward No control issues Less structural hazards Depends on being able to fully generate 4 threads evenly Example if 1 Thread does 75% of the work Utilization = (% time run)(% work done) = (.25)(.75) + (.75)(.25) =.375 = 37.5% CS61C L40 Hardware Parallel Computing (13)

Dynamic Multithreading Adds flexibility in choosing time to switch thread Simultaneous Multithreading (SMT) Called Hyperthreading by Intel Run multiple threads at the same time Just allocate functional units when available Superscalar helps with this CS61C L40 Hardware Parallel Computing (14)

Dynamic Multithreading Example One thread, 8 units Cycle M M FX FX FPFPBRCC 1 2 3 4 5 6 7 8 9 Two threads, 8 units CycleM M FX FX FP FPBRCC 1 2 3 4 5 6 7 8 9 CS61C L40 Hardware Parallel Computing (15)

Multicore Put multiple CPU s on the same die Why is this better than multiple dies? Smaller Cheaper Closer, so lower inter-processor latency Can share a L2 Cache Less power (details) Cost of multicore: complexity and slower single-thread execution CS61C L40 Hardware Parallel Computing (16)

Multicore Example (IBM Power5) Core #1 Shared Stuff Core #2 CS61C L40 Hardware Parallel Computing (17)

Administrivia Proj4 due tonight at 11:59pm Proj1 - Check newsgroup for posting about Proj1 regrades, you may want one Lab tomorrow will have 2 surveys Come to class Friday for the HKN course survey CS61C L40 Hardware Parallel Computing (18)

Upcoming Calendar Week # #15 This Week #16 Sun 2pm Review 10 Evans Mon Parallel Computing in Software Wed Parallel Computing in Hardware (Scott) Thu Lab I/O Networking & 61C Feedback Survey FINAL EXAM THU 12-14 @ 12:30pm-3:30pm 234 Hearst Gym Final exam Same rules as Midterm, except you get 2 double-sided handwritten review sheets (1 from your midterm, 1 new one) + green sheet [Don t bring backpacks] Fri LAST CLASS Summary, Review, & HKN Evals CS61C L40 Hardware Parallel Computing (19)

Real World Example 1: Cell Processor Multicore, and more. CS61C L40 Hardware Parallel Computing (20)

Real World Example 1: Cell Processor 9 Cores (1PPE, 8SPE) at 3.2GHz Power Processing Element (PPE) Supervises all activities, allocates work Is multithreaded (2 threads) Synergystic Processing Element (SPE) Where work gets done Very Superscalar No Cache, only Local Store CS61C L40 Hardware Parallel Computing (21)

Real World Example 1: Cell Processor Great for other multimedia applications such as HDTV, cameras, etc Really dependent on programmer use SPE s and Local Store to get the most out of it CS61C L40 Hardware Parallel Computing (22)

Real World Example 2: Niagara Processor Multithreaded and Multicore 32 Threads (8 cores, 4 threads each) at 1.2GHz Designed for low power Has simpler pipelines to fit more on Maximizes thread level parallelism Project Blackbox CS61C L40 Hardware Parallel Computing (23)

Real World Example 2: Niagara Processor Each thread runs slower (1.2GHz), and there is less number crunching ability (no FP unit), but tons of threads This is great for webservers, where there are typically many simple requests, and many data stalls Can beat faster and more expensive CPU s, while using less power CS61C L40 Hardware Parallel Computing (24)

Peer Instruction 1. The majority of PS3 s processing power comes from the Cell processor 2. A computer that has max utilization can get more done multithreaded 3. Current multicore techniques can scale well to many (32+) cores CS61C L40 Hardware Parallel Computing (25) ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT

Peer Instruction Answer 1. All PS3 is 2.18TFLOPS, Cell is only 204GFLOPS (GPU can do a lot ) FALSE 2. No more functional power FALSE 3. Share memory and caches huge barrier. Why Cell has Local Store FALSE 1. The majority of PS3 s processing power comes from the Cell processor 2. A computer that has max utilization can get more done multithreaded 3. Current multicore techniques can scale well to many (32+) cores ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT CS61C L40 Hardware Parallel Computing (26)

Summary Superscalar: More functional units Multithread: Multiple threads executing on same CPU Multicore: Multiple CPU s on the same die The gains from all these parallel hardware techniques relies heavily on the programmer being able to map their task well to multiple threads Hit up CS150, CS152, CS162 and wikipedia for more info CS61C L40 Hardware Parallel Computing (27)