EECS150 - Digital Design Lecture 24 - High-Level Design (Part 3) + ECC

Similar documents
EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

EECS150 - Digital Design Lecture 09 - Parallelism

Lecture 25: Busses. A Typical Computer Organization

EE 8217 *Reconfigurable Computing Systems Engineering* Sample of Final Examination

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Chapter 4. The Processor

Dependability and ECC

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Pipeline: Introduction

Lecture 19 Introduction to Pipelining

Page 1. Multilevel Memories (Improving performance using a little cash )

EN1640: Design of Computing Systems Topic 06: Memory System

Computer Systems Architecture Spring 2016

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

Chapter 4. The Processor

EECS150 - Digital Design Lecture 17 Memory 2

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

Lecture-13 (ROB and Multi-threading) CS422-Spring

Chapter 4 The Processor 1. Chapter 4A. The Processor

EE/CSCI 451: Parallel and Distributed Computation

Module 4c: Pipelining

Adapted from David Patterson s slides on graduate computer architecture

EECS 579: Built-in Self-Test 3. Regular Circuits

Chapter 2: Memory Hierarchy Design Part 2

CSC 631: High-Performance Computer Architecture

EECS150 - Digital Design Lecture 13 - Combinational Logic & Arithmetic Circuits Part 3

J. Manikandan Research scholar, St. Peter s University, Chennai, Tamilnadu, India.

Advanced Computer Architecture

Agenda. Agenda 11/12/12. Review - 6 Great Ideas in Computer Architecture

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 2: Memory Hierarchy Design Part 2

CS61C : Machine Structures

EN1640: Design of Computing Systems Topic 06: Memory System

CS Computer Architecture

EE282H: Computer Architecture and Organization. EE282H: Computer Architecture and Organization -- Course Overview

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

1. Draw general diagram of computer showing different logical components (3)

COMPUTER ORGANIZATION AND DESIGN

Lecture notes for CS Chapter 2, part 1 10/23/18

Jan Rabaey Homework # 7 Solutions EECS141

CSE502: Computer Architecture CSE 502: Computer Architecture

CS 61C: Great Ideas in Computer Architecture. Dependability. Guest Lecturer: Paul Ruan

LECTURE 5: MEMORY HIERARCHY DESIGN

Main Memory (Fig. 7.13) Main Memory

COMPUTER ORGANIZATION AND DESIGN

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Background. Memory Hierarchies. Register File. Background. Forecast Memory (B5) Motivation for memory hierarchy Cache ECC Virtual memory.

Error Detection And Correction

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Write only as much as necessary. Be brief!

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EECS150 - Digital Design Lecture 12 - Video Interfacing. MIPS150 Video Subsystem

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

Unpipelined Machine. Pipelining the Idea. Pipelining Overview. Pipelined Machine. MIPS Unpipelined. Similar to assembly line in a factory

ECE260: Fundamentals of Computer Engineering

EECS150 - Digital Design Lecture 15 - Video

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Pipelining and Vector Processing

EECS150 - Digital Design Lecture 23 - High-Level Design (Part 1)

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

EEL 4783: HDL in Digital System Design

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

CS250 VLSI Systems Design Lecture 8: Introduction to Hardware Design Patterns

Hardware Implementation of Single Bit Error Correction and Double Bit Error Detection through Selective Bit Placement for Memory

Embedded Systems Ch 15 ARM Organization and Implementation

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

The University of Adelaide, School of Computer Science 13 September 2018

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

CPU Pipelining Issues

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Computer Organization and Levels of Abstraction

CSE123A discussion session

Modern Computer Architecture

Hamming Codes. s 0 s 1 s 2 Error bit No error has occurred c c d3 [E1] c0. Topics in Computer Mathematics

Chapter 4. The Processor

CSE 380 Computer Operating Systems

Cache Optimisation. sometime he thought that there must be a better way

Outline of Presentation Field Programmable Gate Arrays (FPGAs(

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

Chapter 5 B. Large and Fast: Exploiting Memory Hierarchy

University of California at Berkeley College of Engineering Department of Electrical Engineering and Computer Sciences. Spring 2010 May 10, 2010

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Pipelining. Maurizio Palesi

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Comprehensive Exams COMPUTER ARCHITECTURE. Spring April 3, 2006

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Chapter 5. Internal Memory. Yonsei University

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Transcription:

EECS150 - Digital Design Lecture 24 - High-Level Design (Part 3) + ECC April 12, 2012 John Wawrzynek Spring 2012 EECS150 - Lec24-hdl3 Page 1

Parallelism Parallelism is the act of doing more than one thing at a time. Optimization in hardware design often involves using parallelism to trade between cost and performance. Example, Student final grade calculation: read mt1, mt2, mt3, project; grade = 0.2 mt1 + 0.2 mt2 + 0.2 mt3 + 0.4 project; write grade; High performance hardware implementation: As many operations as possible are done in parallel. Spring 2012 EECS150 - Lec23-hld2 Page 2

Optimizing Iterative Computations Types of loops: 1) Looping over input data (streaming): ex: MP3 player, video compressor, music synthesizer. 2) Looping over memory data ex: vector inner product, matrix multiply, list-processing 1) & 2) are really very similar. 1) is often turned into 2) by buffering up input data, and processing offline. Even for online processing, buffers are used to smooth out temporary rate mismatches. 3) CPUs are one big loop. Instruction fetch execute Instruction fetch execute but change their personality with each iteration. 4) Others? Loops offer opportunity for parallelism by executing more than one iteration at once, using parallel iteration execution &/or pipelining Spring 2012 EECS150 - Lec23-hld2 Page 3

Pipelining Principle With looping usually we are less interested in the latency of one iteration and more in the loop execution rate, or throughput. These can be different due to parallel iteration execution &/or pipelining. Pipelining review from CS61C: Analog to washing clothes: step 1: wash step 2: dry step 3: fold (20 minutes) (20 minutes) (20 minutes) 60 minutes x 4 loads 4 hours wash dry fold load1 load2 load3 load4 load1 load2 load3 load4 load1 load2 load3 load4 20 min overlapped 2 hours Spring 2012 EECS150 - Lec23-hld2 Page 4

Limits on Pipelining Without FF overhead, throughput improvement α # of stages. After many stages are added FF overhead begins to dominate: FF overhead is the setup and clk to Q times. Other limiters to effective pipelining: clock skew contributes to clock overhead unequal stages FFs dominate cost clock distribution power consumption feedback (dependencies between loop iterations) Spring 2012 EECS150 - Lec23-hld2 Page 5

Pipelining Example F(x) = y i = a x i 2 + b x i + c Computation graph: x and y are assumed to be streams Divide into 3 (nearly) equal stages. Insert pipeline registers at dashed lines. Can we pipeline basic operators? Spring 2012 EECS150 - Lec23-hld2 Page 6

Possible, but usually not done. (arithmetic units can often be made sufficiently fast without internal pipelining) Example: Pipelined Adder Spring 2012 EECS150 - Lec23-hld2 Page 7

Pipelining Loops with Feedback Example 1: y i = y i-1 + x i + a unpipelined version: add 1 x i +y i-1 x i+1 +y i add 2 y i y i+1 Loop carry dependency Can we cut the feedback and overlap iterations? Try putting a register after add1: add 1 x i +y i-1 x i+1 +y i add 2 y i y i+1 Can t overlap the iterations because of the dependency. The extra register doesn t help the situation (actually hurts). In general, can t pipeline feedback loops. Spring 2012 EECS150 - Lec23-hld2 Page 8

Pipelining Loops with Feedback However, we can overlap the nonfeedback part of the iterations: Loop carry dependency Add is associative and communitive. Therefore we can reorder the computation to shorten the delay of the feedback path: y i = (y i-1 + x i ) + a = (a + x i ) + y i-1 add 1 x i +a x i+1 +a x i+2 +a Shorten the feedback path. add 2 y i y i+1 y i+2 Pipelining is limited to 2 stages. Spring 2012 EECS150 - Lec23-hld2 Page 9

Pipelining Loops with Feedback Example 2: y i = a y i-1 + x i + b Reorder to shorten the feedback loop and try putting register after multiply: add 1 x i +b x i+1 +b x i+2 +b mult ay i-1 ay i ay i+1 add 2 y i y i+1 y i+2 Still need 2 cycles/iteration Spring 2012 EECS150 - Lec23-hld2 Page 10

Pipelining Loops with Feedback Example 2: y i = a y i-1 + x i + b add 1 x i +b x i+1 +b x i+2 +b mult ay i-1 ay i ay i+1 Once again, adding register doesn t help. Best solution is to overlap non-feedback part with feedback part. Therefore critical path includes a multiply in series with add. Can overlap first add with multiply/ add operation. Only 1 cycle/iteration. Higher performance solution (than 2 cycle version). add 2 y i y i+1 y i+2 Alternative is to move register to after multiple, but same critical path. Spring 2012 EECS150 - Lec23-hld2 Page 11

C-slow Technique Another approach to increasing throughput in the presence of feedback: try to fill in holes in the chart with another (independent) computation: add 1 x i +b x i+1 +b x i+2 +b mult ay i-1 ay i ay i+1 add 2 y i y i+1 y i+2 If we have a second similar computation, can interleave it with the first: x 1 x 2 F 1 F 2 y 1 = a 1 y 1 i-1 + x1 i + b1 y 2 = a 2 y 2 i-1 + x2 i + b2 Use muxes to direct each stream. Time multiplex one piece of HW for both stream. Each produces 1 result / 2 cycles. Here the feedback depth=2 cycles (we say C=2). Each loop has throughput of Fclk/C. But the aggregate throughput is Fclk. With this technique we could pipeline even deeper, assuming we could supply C independent streams. Spring 2012 EECS150 - Lec23-hld2 Page 12

C-slow Technique Essentially this means we go ahead and cut feedback path: This makes operations in adjacent pipeline stages independent and allows full cycle for each: C computations (in this case C=2) can use the pipeline simultaneously. Must be independent. Input MUX interleaves input streams. Each stream runs at half the pipeline frequency. Pipeline achieves full throughput. Multithreaded Processors use this. add 1 x+b x+b x+b x+b x+b x+b mult ay ay ay ay ay ay add 2 y y y y y y Spring 2012 EECS150 - Lec23-hld2 Page 13

Beyond Pipelining - SIMD Parallelism An obvious way to exploit more parallelism from loops is to make multiple instances of the loop execution data-path and run them in parallel, sharing the some controller. For P instances, throughput improves by a factor of P. example: y i = f(x i ) x i f x i+1 f x i+2 f x i+3 f Usually called SIMD parallelism. Single Instruction Multiple Data y i y i+1 y i+2 y i+3 Assumes the next 4 x values available at once. The validity of this assumption depends on the ratio of f repeat rate to input rate (or memory bandwidth). Cost α P. Usually, much higher than for pipelining. However, potentially provides a high speedup. Often applied after pipelining. Limited, once again, by loop carry dependencies. Feedback translates to dependencies between parallel data-paths. Vector processors use this technique. Spring 2012 EECS150 - Lec23-hld2 Page 14

SIMD Parallelism with Feedback Example, from earlier: y i = a y i-1 + x i + b In this example end up with carry ripple situation. Could employ look-ahead / parallel-prefix optimization techniques to speed up propagation. As with pipelining, this technique is most effective in the absence of a loop carry dependence. Spring 2012 EECS150 - Lec23-hld2 Page 15

Error Correction Codes Spring 2012 EECS150 - Lec24-hld3 Page 16

Error Correction Codes (ECC) Memory systems generate errors (accidentally flipped-bits) DRAMs store very little charge per bit Soft errors occur occasionally when cells are struck by alpha particles or other environmental upsets. Less frequently, hard errors can occur when chips permanently fail. Where perfect memory is required servers, spacecraft/military computers, Memories are protected against failures with ECCs Extra bits are added to each data-word extra bits are used to detect and/or correct faults in the memory system in general, each possible data word value is mapped to a unique code word. A fault changes a valid code word to an invalid one - which can be detected. Spring 2012 EECS150 - Lec24-hld3 Page 17

Simple Error Detection Coding Parity Bit Each data value, before it is written to memory is tagged with an extra bit to force the stored word to have even parity: b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 p Each word, as it is read from memory is checked by finding its parity (including the parity bit). b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 p + + A non-zero parity indicates an error occurred: two errors (on different bits) is not detected (nor any even number of errors) odd numbers of errors are detected. Spring 2012 EECS150 - Lec24-hld3 Page 18 c

Hamming Error Correcting Code Use more parity bits to pinpoint bit(s) in error, so they can be corrected. Example: Single error correction (SEC) on 4-bit data use 3 parity bits, with 4-data bits results in 7-bit code word 3 parity bits sufficient to identify any one of 7 code word bits overlap the assignment of parity bits so that a single error in the 7-bit word can be corrected Procedure: group parity bits so they correspond to subsets of the 7 bits: p 1 protects bits 1,3,5,7 p 2 protects bits 2,3,6,7 p 3 protects bits 4,5,6,7 1 2 3 4 5 6 7 p 1 p 2 d 1 p 3 d 2 d 3 d 4 Bit position number 001 = 1 10 011 = 3 10 101 = 5 10 111 = 7 10 010 = 2 10 011 = 3 10 110 = 6 10 111 = 7 10 100 = 4 10 101 = 5 10 110 = 6 10 111 = 7 10 Spring 2012 EECS150 - Lec24-hld3 Page 19 p 1 p 2 p 3 Note: number bits from left to right.

1 2 3 4 5 6 7 p 1 p 2 d 1 p 3 d 2 d 3 d 4 Hamming Code Example Note: parity bits occupy power-oftwo bit positions in code-word. On writing to memory: parity bits are assigned to force even parity over their respective groups. On reading from memory: check bits (c 3,c 2,c 1 ) are generated by finding the parity of the group and its parity bit. If an error occurred in a group, the corresponding check bit will be 1, if no error the check bit will be 0. check bits (c 3,c 2,c 1 ) form the position of the bit in error. Example: c = c 3 c 2 c 1 = 101 error in 4,5,6, or 7 (by c 3 =1) error in 1,3,5, or 7 (by c 1 =1) no error in 2, 3, 6, or 7 (by c 2 =0) Therefore error must be in bit 5. Note the check bits point to 5 By our clever positioning and assignment of parity bits, the check bits always address the position of the error! c=000 indicates no error Spring 2012 EECS150 - Lec24-hld3 Page 20

Hamming Error Correcting Code Overhead involved in single error correction code: let p be the total number of parity bits and d the number of data bits in a p + d bit word. If p error correction bits are to point to the error bit (p + d cases) plus indicate that no error exists (1 case), we need: 2 p >= p + d + 1, thus p >= log(p + d + 1) for large d, p approaches log(d) Adding on extra parity bit covering the entire word can provide double error detection 1 2 3 4 5 6 7 8 p 1 p 2 d 1 p 3 d 2 d 3 d 4 p 4 On reading the C bits are computed (as usual) plus the parity over the entire word, P: C=0 P=0, no error C!=0 P=1, correctable single error C!=0 P=0, a double error occurred C=0 P=1, an error occurred in p 4 bit Typical modern codes in DRAM memory systems: 64-bit data blocks (8 bytes) with 72-bit code words (9 bytes), results in SEC, DED. Spring 2012 EECS150 - Lec24-hld3 Page 21

Error Correction with LFSRs Spring 2012 EECS150 - Lec24-hld3 Page 22

Error Correction with LFSRs XOR Q4 with incoming bit sequence. Now values of shift-register don t follow a fixed pattern. Dependent on input sequence. Look at the value of the register after 15 cycles: 1010 Note the length of the input sequence is 2 4-1 = 15 (same as the number of different nonzero patters for the original LFSR) Binary message occupies only 11 bits, the remaining 4 bits are 0000. They would be replaced by the final result of our LFSR: 1010 If we run the sequence back through the LFSR with the replaced bits, we would get 0000 for the final result. 4 parity bits neutralize the sequence with respect to the LFSR. 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0 1 0 1 0 1 1 0 0 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0 If parity bits not all zero, an error occurred in transmission. If number of parity bits = log total number of bits, then single bit errors can be corrected. Using more parity bits allows more errors to be detected. Ethernet uses 32 parity bits per frame (packet) with 16-bit LFSR. Spring 2012 EECS150 - Lec24-hld3 Page 23