EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

Similar documents
EECS150 - Digital Design Lecture 09 - Parallelism

EECS150 - Digital Design Lecture 24 - High-Level Design (Part 3) + ECC

Pipeline: Introduction

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Computer Systems Architecture Spring 2016

Module 4c: Pipelining

Lecture 19 Introduction to Pipelining

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining. Maurizio Palesi

Chapter 4. The Processor

EE178 Spring 2018 Lecture Module 4. Eric Crabill

Alexandria University

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

EE 8217 *Reconfigurable Computing Systems Engineering* Sample of Final Examination

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Outline. Introduction to Structured VLSI Design. Signed and Unsigned Integers. 8 bit Signed/Unsigned Integers

Computer Architecture

EECS150 - Digital Design Lecture 13 - Combinational Logic & Arithmetic Circuits Part 3

Modern Computer Architecture

EITF20: Computer Architecture Part2.2.1: Pipeline-1

PIPELINE AND VECTOR PROCESSING

Lecture 41: Introduction to Reconfigurable Computing

CPU Pipelining Issues

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Chapter 4. The Processor

EECS150 - Digital Design Lecture 23 - High-Level Design (Part 1)

CS 61C: Great Ideas in Computer Architecture Control and Pipelining

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Chapter 4. The Processor

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Advanced Synthesis Techniques

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EE/CSCI 451: Parallel and Distributed Computation

Pipelining: Hazards Ver. Jan 14, 2014

CS250 VLSI Systems Design Lecture 8: Introduction to Hardware Design Patterns

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

ELCT 501: Digital System Design

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Module 5 Introduction to Parallel Processing Systems

Typical DSP application

Lecture: Pipeline Wrap-Up and Static ILP

What is Pipelining? RISC remainder (our assumptions)

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 3, 2015

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

CS294-48: Hardware Design Patterns Berkeley Hardware Pattern Language Version 0.5. Krste Asanovic UC Berkeley Fall 2009

Overview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 2, 2016

EEL 4783: HDL in Digital System Design

CS 5803 Introduction to High Performance Computer Architecture: Arithmetic Logic Unit. A.R. Hurson 323 CS Building, Missouri S&T

ECE260: Fundamentals of Computer Engineering

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Working on the Pipeline

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Vertex Shader Design I

CS150 Fall 2012 Solutions to Homework 6

Chapter 3 & Appendix C Pipelining Part A: Basic and Intermediate Concepts

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Lecture-13 (ROB and Multi-threading) CS422-Spring

Chapter 8. Pipelining

CS 1013 Advance Computer Architecture UNIT I

COMPUTER ORGANIZATION AND DESIGN

Instruction Pipelining Review

CS 61C Summer 2012 Discussion 11 State, Timing, and CPU (Solutions)

Lecture 6: Pipelining

Binary Adders. Ripple-Carry Adder

EE282H: Computer Architecture and Organization. EE282H: Computer Architecture and Organization -- Course Overview

! Readings! ! Room-level, on-chip! vs.!

Early Performance-Cost Estimation of Application-Specific Data Path Pipelining

CS 426 Parallel Computing. Parallel Computing Platforms

EE/CSCI 451: Parallel and Distributed Computation

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

Unpipelined Machine. Pipelining the Idea. Pipelining Overview. Pipelined Machine. MIPS Unpipelined. Similar to assembly line in a factory

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

Readings. H+P Appendix A, Chapter 2.3 This will be partly review for those who took ECE 152

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Verilog for High Performance

Jan Rabaey Homework # 7 Solutions EECS141

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Pipelining and Vector Processing

Read Only Memory ROM

Outline Marquette University

HY425 Lecture 09: Software to exploit ILP

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

HY425 Lecture 09: Software to exploit ILP

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

Chapter 4 The Processor 1. Chapter 4A. The Processor

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Transcription:

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141

Outline Parallelism EE141 2

Parallelism Parallelism is the act of doing more than one thing at a time. Optimization in hardware design often involves using parallelism to trade between cost and performance. Parallelism can often also be used to improve energy efficiency. Example, Student final grade calculation: read mt1, mt2, mt3, project; grade = 0.2 mt1 + 0.2 mt2 + 0.2 mt3 + 0.4 project; write grade; High performance hardware implementation: As many operations as possible are done in parallel. Spring 2018 EECS151 Page 3

A log n lower (time) bound to compute any function of n variables Assume we can only use binary operations, each taking unit time After 1 time unit, an output can only depend on two inputs Use induction to show that after k time units, an output can only depend on 2 k inputs After log 2 n time units, output depends on at most n inputs A binary tree performs such a computation EE141 Demmel - CS267 Lecture 6+

Reductions with Trees log 2 N If each node (operator) is k-ary instead of binary, what is the delay? N EE141 Demmel - CS267 Lecture 6+

x 0 Trees for optimization x 1 x 2 x 3 x 4 x 5 x 6 x 7 + + + + + + + ((((((x 0 + x 1 ) + x 2 ) + x 3 ) + x 4 ) + x 5 ) + x 6 ) + x 7 T = O(N) + + + + + + + T = O(log N) (( x 0 + x 1 ) + ( x 2 + x 3 )) + (( x 4 + x 5 ) + ( x 6 + x 7 )) What property of + are we exploiting? EE141 Other associate operators? Boolean operations? Division? Min/Max? 6

Parallelism Is there a lower cost hardware implementation? Different tree organization? Can factor out multiply by 0.2: How about sharing operators (multipliers and adders)? Spring 2018 EECS151 Page 7

Time multiplex single ALU for all adds and multiplies: Attempts to minimize cost at the expense of time. Need to add extra register, muxes, control. Time-Multiplexing If we adopt above approach, we can then consider the combinational hardware circuit diagram as an abstract computation-graph. Using other primitives, other coverings are possible. This time-multiplexing covers the computation graph by performing the action of each node one at a time. (Sort of emulates it.) Spring 2018 EECS151 Page 8

HW versus SW This time-multiplexed ALU approach is very similar to what a conventional software version would accomplish: CPUs time-multiplex function units (ALUs, etc.) add r2,r1,r3 add r2,r2,r4 mult r2,r4,r5. This model matches our tendency to express computation sequentially - even though most computations naturally contain parallelism. Our programming languages also strengthen a sequential tendency. In hardware we have the ability to exploit problem parallelism - gives us a knob to tradeoff performance & cost. Maybe best to express computations as abstract computations graphs (rather than programs ) - should lead to wider range of implementations. Note: modern processors spend much of their cost budget attempting to restore execution parallelism: super-scalar execution. Spring 2018 EECS151 Page 9

Example: Video Codec Exploiting Parallelism in HW Separate algorithm blocks implemented in separate HW blocks, or HW is time-multiplexed. Entire operation is pipelined (with possible pipelining within the blocks). Loop unrolling used within blocks or for entire computation. Spring 2018 EECS151 Page 10

Optimizing Iterative Computations Hardware implementations of computations almost always involves looping. Why? Is this true with software? Are there programs without loops? Maybe in through away code. We probably would not bother building such a thing into hardware, would we? (FPGA may change this.) Fact is, our computations are closely tied to loops. Almost all our HW includes some looping mechanism. What do we use looping for? Spring 2018 EECS151 Page 11

Types of loops: Optimizing Iterative Computations 1) Looping over input data (streaming): ex: MP3 player, video compressor, music synthesizer. 2) Looping over memory data ex: vector inner product, matrix multiply, list-processing 1) & 2) are really very similar. 1) is often turned into 2) by buffering up input data, and processing offline. Even for online processing, buffers are used to smooth out temporary rate mismatches. 3) CPUs are one big loop. Instruction fetch execute Instruction fetch execute but change their personality with each iteration. 4) Others? Loops offer opportunity for parallelism by executing more than one iteration at once, using parallel iteration execution &/or pipelining Spring 2018 EECS151 Page 12

Pipelining Principle With looping usually we are less interested in the latency of one iteration and more in the loop execution rate, or throughput. These can be different due to parallel iteration execution &/or pipelining. Pipelining review from CS61C: Analog to washing clothes: step 1: wash step 2: dry step 3: fold (20 minutes) (20 minutes) (20 minutes) 60 minutes x 4 loads 4 hours wash dry fold load1 load2 load3 load4 load1 load2 load3 load4 load1 load2 load3 load4 20 min overlapped 2 hours Spring 2018 EECS151 Page 13

Pipelining wash dry fold load1 load2 load3 load4 load1 load2 load3 load4 load1 load2 load3 load4 In the limit, as we increase the number of loads, the average time per load approaches 20 minutes. The latency (time from start to end) for one load = 60 min. The throughput = 3 loads/hour The pipelined throughput # of pipe stages x un-pipelined throughput. Spring 2018 EECS151 Page 14

Pipelining General principle: Assume T=8ns T FF (setup +clk q)=1ns F = 1/9ns = 111MHz Cut the CL block into pieces (stages) and separate with registers: T = 4ns + 1ns + 4ns +1ns = 10ns F = 1/(4ns +1ns) = 200MHz Assume T1 = T2 = 4ns CL block produces a new result every 5ns instead of every 9ns. Spring 2018 EECS151 Page 15

Limits on Pipelining Without FF overhead, throughput improvement α # of stages. After many stages are added FF overhead begins to dominate: FF overhead is the setup and clk to Q times. Other limiters to effective pipelining: clock skew contributes to clock overhead unequal stages FFs dominate cost clock distribution power consumption feedback (dependencies between loop iterations) Spring 2018 EECS151 Page 16

Pipelining Example F(x) = y i = a x i 2 + b x i + c Computation graph: x and y are assumed to be streams Divide into 3 (nearly) equal stages. Insert pipeline registers at dashed lines. Can we pipeline basic operators? Spring 2018 EECS151 Page 17

Possible, but usually not done. (arithmetic units can often be made sufficiently fast without internal pipelining) More common to pipeline multiplication. Example: Pipelined Adder Spring 2018 EECS151 Page 18

Pipelining Loops with Feedback Example 1: y i = y i-1 + x i + a unpipelined version: add 1 x i +y i-1 x i+1 +y i add 2 y i y i+1 Loop carry dependency Can we cut the feedback and overlap iterations? Try putting a register after add1: add 1 x i +y i-1 x i+1 +y i add 2 y i y i+1 Can t overlap the iterations because of the dependency. The extra register doesn t help the situation (actually hurts). In general, can t effectively pipeline feedback loops. Spring 2018 EECS151 Page 19

Pipelining Loops with Feedback However, we can overlap the nonfeedback part of the iterations: Loop carry dependency Add is associative and communitive. Therefore we can reorder the computation to shorten the delay of the feedback path: y i = (y i-1 + x i ) + a = (a + x i ) + y i-1 add 1 x i +a x i+1 +a x i+2 +a Shorten the feedback path. add 2 y i y i+1 y i+2 Pipelining is limited to 2 stages. Spring 2018 EECS151 Page 20

Pipelining Loops with Feedback Example 2: y i = a y i-1 + x i + b Reorder to shorten the feedback loop and try putting register after multiply: add 1 x i +b x i+1 +b x i+2 +b mult ay i-1 ay i ay i+1 add 2 y i y i+1 y i+2 Still need 2 cycles/iteration Spring 2018 EECS151 Page 21

Pipelining Loops with Feedback Example 2: y i = a y i-1 + x i + b Once again, adding register doesn t help. Best solution is to overlap non-feedback part with feedback part. Therefore critical path includes a multiply in series with add. Can overlap first add with multiply/ add operation. Only 1 cycle/iteration. Higher performance solution (than 2 cycle version). add 1 x i +b x i+1 +b x i+2 +b mult ay i-1 ay i ay i+1 add 2 y i y i+1 y i+2 Alternative is to move register to after multiple, but same critical path. Spring 2018 EECS151 Page 22

C-slow Technique Another approach to increasing throughput in the presence of feedback: try to fill in holes in the chart with another (independent) computation: add 1 x i +b x i+1 +b x i+2 +b mult ay i-1 ay i ay i+1 add 2 y i y i+1 y i+2 If we have a second similar computation, can interleave it with the first: x 1 x 2 F 1 F 2 y 1 = a 1 y 1 i-1 + x1 i + b1 y 2 = a 2 y 2 i-1 + x2 i + b2 Use muxes to direct each stream. Time multiplex one piece of HW for both stream. Each produces 1 result / 2 cycles. Here the feedback depth=2 cycles (we say C=2). Each loop has throughput of Fclk/C. But the aggregate throughput is Fclk. With this technique we could pipeline even deeper, assuming we could supply C independent streams. Spring 2018 EECS151 Page 23

Essentially this means we go ahead and cut feedback path: This makes operations in adjacent pipeline stages independent and allows full cycle for each: C-slow Technique C computations (in this case C=2) can use the pipeline simultaneously. Must be independent. Input MUX interleaves input streams. Each stream runs at half the pipeline frequency. Pipeline achieves full throughput. Multithreaded Processors use this. add 1 x+b x+b x+b x+b x+b x+b mult ay ay ay ay ay ay add 2 y y y y y y Spring 2018 EECS151 Page 24

Beyond Pipelining - SIMD Parallelism An obvious way to exploit more parallelism from loops is to make multiple instances of the loop execution data-path and run them in parallel, sharing the some controller. For P instances, throughput improves by a factor of P. example: y i = f(x i ) x i f x i+1 f x i+2 f x i+3 f Usually called SIMD parallelism. Single Instruction Multiple Data y i y i+1 y i+2 y i+3 Assumes the next 4 x values available at once. The validity of this assumption depends on the ratio of f repeat rate to input rate (or memory bandwidth). Cost α P. Usually, much higher than for pipelining. However, potentially provides a high speedup. Often applied after pipelining. Vector processors use this technique. Limited, once again, by loop carry dependencies. Feedback translates to dependencies between parallel data-paths. Spring 2018 EECS151 Page 25

SIMD Parallelism with Feedback Example, from earlier: y i = a y i-1 + x i + b As with pipelining, this technique is most effective in the absence of a loop carry dependence. With loop carry dependence, end up with carry ripple situation. Could employ look-ahead / parallel-prefix optimization techniques to speed up propagation. Spring 2018 EECS151 Page 26