Name: Computer Science 252 Quiz #2
|
|
- Michael Price
- 5 years ago
- Views:
Transcription
1 Computer Science 252 Quiz #2 Name: 1 [10 points] Your task is to design the an I/O system with the best cost performance given a limited budget. Performance is measured as MB/s using average disk access sizes. Ignore the cost of the CPU, and assume that the main memory and processor has enough bandwidth to keep up with whatever I/O system you design. Also the operating system in the file server has no file cache because all the main memory is needed for the the application. Here are the ground rules: The storage capacity must be at least 10 GB. Output to the disk cannot be buffered (the data must be written on the disk before the write is considered complete) To maintain a reasonable response time, no resource can be used at more than 80% of its rated maximum bandwidth The maximum number of disks connected over a string to a disk controller is 10. The ratio of disk reads to disk writes is 2:1. The average size of a disk access is 4 KB The seek distribution is like the UNIX timesharing workload in Figure 9.40 on page 559. Assume that every disk I/O requires an average rotational delay. Here are the costs and performance of the components: Item Cost Maximum performance Main Memory $1000/MB 250-ns for 8 bytes I/O bus+rack $1000 Up to 15 disk controllers Disk controller+string $3000 Extra overhead is 1 ms per I/O event and string bandwidth is 10 MB/sec 3.5 inch disk $ GB storage, rotates at 3600 RPM, seeks of 16 tracks or greater take 10 ms, seeks of 1 to 15 tracks take 5 ms, and there is no seek time for the same track. The maximum transfer rate is 2 MB/s. 2.5 inch disk $ GB storage, rotates at 5400 RPM, seeks of 16 tracks or greater take 8 ms, seeks of 1 to 15 tracks take 4 ms, and there is no seek time for the same track. The maximum transfer rate is 1.75 MB/s. 1.8 inch disk $ GB storage, rotates at 7200 RPM, seeks of 16 tracks or greater take 6 ms, seeks of 1 to 15 tracks take 3 ms, and there is no seek time for the same track. The maximum transfer rate is 1.50 MB/s. Your maximum budget is $40,000. A few comments on possible answers: Memory was useful only if you wanted to add a disk cache The read to write ratio was only important if you were using a cache The 80% bandwidth limit primarily applied to the string limit, but it was hard to get 10 disks and a file cache to trasnsfer as much as 8 MB/s. Many people limited the disks to 80% of their maximum I/Os per second, which is a reasonable idea 1 Quiz #2 S90 with Answers 10/30/94
2 (a) [1 point] List the cost of the components of your I/O system Item Number Cost TOTAL The basic disk design options were: inch disks+1 controllers+1 bus = $24, inch disks+2 controllers+1 bus = $29, inch disks+4 controllers+1 bus = $37,000 The remaining money can be used as a disk cache to help reads (b) [4 points] Draw a diagram of your I/O design (c) [3 points] Calculate the performance of your I/O system. List all assumptions. Common mistakes were Not including 0 length seeks in average seek time Not including rotational latency in disk access time Not including block transfer time in disk access time (d) [1 point] What is the MB/s per $1000 of your system? (e) [1 point] What is the strengths of your I/O system compared to other options? 2 Quiz #2 S90 with Answers 10/30/94
3 2. [28 points] The cost/performance of three DLX processors is to be examined, taking two approaches to performance Except where stated otherwise assume the same hardware resources as in the book for each approach: Model A: The superscalar DLX issues up to one integer and one floating point instruction each clock cycle. Model T: Similar to the model A, this superscalar DLX also issues up to one integer and one floating point instruction each clock cycle, but includes hardware for the Tomasulo Algorithm as described in Section 6.7. Model V: The vector version of DLX, called DLXV in Chapter 7. It includes support for chaining. All three models have an instruction cache containing 4096 bytes using 16 byte blocks that uses direct mapping. The models A and T also have a 4096-byte data cache with 128 byte blocks that is also direct mapped. This is a write through cache. All models have the same hardware building blocks: A fully pipelined adder whose latency is 3 clock cycles A fully pipelined multiplier whose latency is 5 clock cycles The memory is contains 16 banks, with each bank being 64-bits wide. The latency for a memory access is 10 clock cycles. The memory system delivers one 64-bit quantity per bank after the initial memory latency (a) [5 points] Here is the cost of each resource needed to construct the models (a microacre is an imaginary unit of silicon area): 1 bit of static RAM: microacres 1 bit of register: microacres 64-bit pipelined floating point adder: 1 microacres 64-bit pipelined floating point multiplier: 1 microacres Vector control unit (not including registers): 2 microacres Tomasulo control unit: 2 microacres Single Tomasulo reservation station: 0.1 microacres Integer control unit and ALU (not including 1 microacre registers): Cache controller (including address comparator and muxes but no RAM): 1 microacre What is the cost of each model? Fill in the rest of the categories and sum the resources of each model. (Round to the nearest microacre.) Common Mistakes Forgetting to include tag memory space (-2) Forgetting that 4KB cache is bytes, not bits (-1) Miscalculating size of vector registers: 8*64*64 (-1) Forgetting vector registers (-1) 3 Quiz #2 S90 with Answers 10/30/94
4 Model A Model T Model V Integer control unit 1 Integer control unit and ALU 1 Integer control unit 1 and ALU and ALU Integer regs 2 Integer regs 2 Integer regs 2 (32*32*2) (32*32*2) (32*32*2) Scalar FP regs 2 Scalar FP regs 2 Scalar FP regs 2 (16*64*2) (16*64*2) (16*64*2) FP adder 1 FP adder 1 FP adder 1 FP multiplier 1 FP multiplier 1 FP multiplier 1 Instruction Cache Control 1 Instruction Cache Control 1 Instruction Cache Control 1 Data Cache Control 1 Data Cache Control Tomasulo control unit 2 Vector control unit 2 (total so far) (9) (total so far) (11) (total so far) (10) Instruction Cache 5. 4 Instruction Cache Tag RAM 5. 4 Instruction Cache 5.4 Tag RAM (32-lg(16)-lg(4k/16)+ 1 Tag RAM (21*256*1) valid) *(4k/16)*1 (20*256*1) Instruction Cache Data RAM (4k*8) Data Cache Tag RAM (22*32*1) Data Cache Data RAM(4k*8*1) Instruction Cache Data RAM(4k*8*1) 0. 7 Data Cache Tag RAM (32- lg(128)-lg(4k/128)+ 1 valid) *(4k/128)*1 32. Data Cache Data 8 RAM(4k*8*1) 3 Tomasulo adder reservation stations 2 Tomasulo multiplier reservation stations Load Buffer (6*64*2) 0. 8 Store Buffer (3*(32+ 64)*2) 0. 6 Instruction Cache Data RAM (4k*8) Vector Registers 65.5 (8*64*64*2) 0.2 TOTAL 81 TOTAL 85 TOTAL 114 Here is the kernel that will be used for the comparison. Q,X, and Z are double precision floating point numbers, with the dimensions of X and Z being Q is in Fq, K is in Rk, the base address of array Z is in Rz, and the base address of array X is in Rx. C*** KERNEL 3 INNER PRODUCT C Q = 0.0 DO 3 K = 1, Q = Q + Z(K)*X(K) Each program should take no more than 20 lines. Given the time available, do not spend much time on optimization of the code. The loop at the bottom can be changed if necessary. 4 Quiz #2 S90 with Answers 10/30/94
5 (b) [3 points] Write an efficient DLX version of this program for Model A. (The end of loop code is written at the bottom just to give you as much space as you need): Label: Instruction (integer) Instruction (FP) Comment 1. ADDI Rk,R0,#1 MOVI2FP Fq,R0 K = 1 and Q = 0 2. MOVD F10,Fq F10 = 0 (to start loop) 3. MOVD F12,Fq F12 = 0 (to start loop) 4.Loop: LD F0,(Rz) Load Z[k] 5. LD F2,(Rx) ADDD Fq,Fq,F10 Load X[k]& Q=Q+Z*X[k+1] 6. LD F6,(Rz)8 Load Z[k+1 ] 7. LD F8,(Rx)8 MULTD F4,F0,F2 Load X[k+1 ] & Z[k]*X[k] 8. ADDI Rz,Rz,#2*8 9. ADDI Rk,Rk,#2 MULTD F10,F6,F8 K=K+2 & Z[k+1]*X[k+1] 10. SLE R1,Rk,#1024 <= 1024? 11. BNE R1, Loop Go to loop if <= 12. ADDI Rx,Rx,#2*8 ADDD F12,F12,F4 Q =Q +Z*X[k];delay slot 13. ADDD Fq,Fq,F12 Q=Q+Q ; outside loop (c) [2 points] What changes (if any) would you make to run this program on Model T. Why? I wouldn t unroll loop as above since Tomasulo algorithm hardware gets the same effect by using buffers, renaming registers, and allowing out-of-order execution. (d) [3 points] Write an efficient DLXV version of this program for Model V: Label: Instruction Comment 1. LV V0,(Rx)zero Set V0 = 0 2. ADDI Rk,R0,#1 K = 1 (Overlap with LV) 3. MOVI2FP Fq,R0 Q = 0 (Overlap with LV) 4.Loop: LV V1,(Rz) V1 = Z[k..k+63] 5. ADDI Rz,Rz,#64*8 Increment Z base during vector load 6. ADDI Rk,Rk,#64 Increment K by 64 (overlap with LV) 7. SLE R1,Rk,#1024 <= 1024? (overlap with LV) 8. LV V2,(Rx) V2 = X[k..k+63] (adjust for early increment) 9. MULTV V3,V1,V2 V3 = V1* V2 (chained) 10. ADDV V0,V0,V3 V0 = V0 + V3 (chained); partial sum 11. BNE R1, Loop Go to loop if <= 12. ADDI Rx,Rx,#64*8 Increment X base (delay slot) 5 Quiz #2 S90 with Answers 10/30/94
6 13. S V V0,(Rx) Store V0 in memory so you can add it up 14. ADDI R2,Rx,#64*8 Set ending address (overlap with SV) 15.Loop2: LD F2,(Rx) Load element of V0 16. SLT R1,Rx,R2 Rx < 64*8 (load delay slot) ; early so < vs <= 17. ADDD Fq,Fq,F2 Q = Q + V0[i] 18. BNE R1, Loop2 Go to loop if < 19. ADDI Rx,Rx,#8 Increment address (delay slot) (e) [1 point] What is the total number of instruction cache misses for each version of the program? (Note: this question is not the miss rate). Then give the number of clock cycles for instruction cache misses. Explain your calculation. Simply the number of unique instructions divided by 4 (16 byte block / 4 bytes/instr) since there all the code fits in the cache. (You must round up if the instructions partially fill one block) A (unrolled) = 18 instructions => 5 misses T (rolled) = 12 instructions => 3 misses V = 19 instructions => 5 misses Instruction cache miss penalty is memory latency plus the block size in bytes divided by the number of bytes transfered per clock cycle. = Memory latency + (block size / 8 1) = /8 1 = 11 (or could say that need an extra clock for the first 8 bytes) Model A : 55 clock cycles Model T: 33 clock cycles Model V: 55 clock cycles (f) [2 points] What is the total number of data misses for each version of the program? (Note: this question is not the miss rate). Then give the number of clock cycles for data cache misses. Explain your calculation. Assuming arrays are not alligned in memory so that they line up exactly on top of each other in a cache (e.g., Z[0] does not use the same cache block as X[0]), you only have compulsory misses = > 2*1024 references / (128 bytes in block/ 8 bytes per element) = 2048 / 16 = 128 misses Data cache miss penalty is memory latency plus the block size in bytes divided by the number of bytes transfered per clock cycle. = Memory latency + (block size / 8 1) = /8 1 = 25 (or could say that need an extra clock for the first 8 bytes) Model A : 3200 clock cycles Model T: 3200 clock cycles Model V: 0 clock cycles on data cache misses, since no data cache! (g) [10 points] Calculate the number of clock cycles it takes to execute the program for each model of DLX. Show your work. Model A: 6 Quiz #2 S90 with Answers 10/30/94
7 The unrolled loop is scheduled so that there are no stalls due to structural hazards Time in loop ignoring cache misses: 3* /2 *9 +1*3 = 4614 Total time including caches = = 7869 Model T: The rolled loop is avoids stalls automatically, with scalar and floating point overlaped. Let s ignore the extra cycles until it reaches steady state, and assume all integer instructions are overlaped with floating point. The limit to the number of operations completed per loop is then the number of buffers and reservation stations. 6 load buffers means at most 6 operands can be waiting 3 add reservation stations means at most 3 adds can be pending 2 mult reservation stations means at most 2 adds can be pending Time in loop ignoring cache misses: * (floating point latency / number of pending operations) = * (8/2) = 4097 Assuming that the Model T stalls during a cache miss, total time is = = 7330 An alternate assumption is that the Tomasulo hardware does not stall during a cache miss, but continues. Then the question is what is the delay during a cache miss when operating at full efficiency. Assume that as soon as a load buffer is freed the access is made to the element of Z that causes the miss. During the 25 clock cycles of the miss the pipeline will drain (since it takes only about 10 clocks to complete the remaning operations). The Tomasulo hardware will then establish all the dependencies for the remaining operations. Since the pipeline is drained, it will take just as long to restablish the pipeline once the missing operands are fetched. Hence, there is at most a small performance advantage in allowing the cache to continue duing a miss. (In fact, a miss on Z will be followed immediately by a miss on X and you need an element from each to continue, so the time gap is really that of two cache misses and whatever potential overlap is possible during the two misses while waiting for both misses less the time to refill the pipeline after the misses.) To make an improvement you would need to assume an early restart cache that did not wait until all the of the data of a block arrived. (And since a miss on Z is followed by a miss on X, you would still have to wait the full miss penalty on Z plus the latency on X.) Model V: Instruction Start Complete Prologue LV = 75 Loop1 LV = 75 LV 75+1 = = 150 MULTV = = ADDV = = 160 Epilogue S V = 75 Loop2 L D 1 11 clock cycles (no cache!) (4 instructions, one clock each) 7 Quiz #2 S90 with Answers 10/30/94
8 Vector time = Prologue + Loop1* 1024/64 + Epilogue + Loop2*64 = * *64 = 3670 clock cycles Give a short explanation why the winner is fastest: Model V is fastest because (1) The calculation is quicker because the floating point units are being run at the maximum pipeline rate a larger fraction of the time (2) Model V pays two memory latencies every 64 elements (vector load start ups) vs. 2 memory latencies every 16 elements (determined by block size before another cache miss) (3) Model V begins execution as soon as the first word from the second vector is fetched (11 clocks) as opposed to waiting for the full cache to be loaded (25 clocks) (4) There is no reuse of the data, so the only benefit of the data cache is spatial locality, and the same effect if achieved by vector load (5) While the code is not completely vectorizable, by doing the vector add to achieve the partial sum in the main loop most of the adds are performed at vector rates. (In fact the vector operations is 64*16 = 1024 multiples + 64*16 = 1024 adds vs 64 scalar adds, giving a vectorization rate of 2048/2112 or 97%.) (h) [2 points] Using the cost information from your answer to (a) and your performance calculation from your answer to (g), which model has the best cost-performance? Assuming all the cost is in the processor Model Area Area 3 (see p. 60) Clocks Fl. Pt. Ops per clock Perf /Cost Normalized to Model A ( Cost) ( Performance) (*10 7 ) A , % T , % V 114 1,481, % Model A has the best cost performance Also give a short explanation why you think the winner has the best cost performance. Since cost goes with the 3rd power of area, the smaller area of A offsets the slightly higher speed of T. And while Model V is more than twice as fast as Model A, its larger area results in a cost that is almost three times that of A, hence model A still has better cost performance. 8 Quiz #2 S90 with Answers 10/30/94
Hardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationGood luck and have fun!
Midterm Exam October 13, 2014 Name: Problem 1 2 3 4 total Points Exam rules: Time: 90 minutes. Individual test: No team work! Open book, open notes. No electronic devices, except an unprogrammed calculator.
More informationCS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes
CS433 Midterm Prof Josep Torrellas October 16, 2014 Time: 1 hour + 15 minutes Name: Alias: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your
More informationEE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes
NAME: STUDENT NUMBER: EE557--FALL 1999 MAKE-UP MIDTERM 1 Closed books, closed notes Q1: /1 Q2: /1 Q3: /1 Q4: /1 Q5: /15 Q6: /1 TOTAL: /65 Grade: /25 1 QUESTION 1(Performance evaluation) 1 points We are
More informationCMSC411 Fall 2013 Midterm 2 Solutions
CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has
More informationCSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics
CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics Computing system: performance, speedup, performance/cost Origins and benefits of scalar instruction pipelines and caches
More informationQuestion 1: Calculate Your Cache A certain system with a 350 MHz clock uses a separate data and instruction cache, and a uniæed second-level cache. Th
University of California, Berkeley College of Engineering Computer Science Division EECS Spring 1998 D.A. Patterson Quiz 1 March 4, 1998 CS252 Graduate Computer Architecture You are allowed to use a calculator
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More informationQuestion 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the
Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. he memory is word addressable he size of the cache is 8 blocks; each block is 4 words (32 words cache).
More informationCS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationComputer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key
Computer Architecture and Engineering CS152 Quiz #5 April 23rd, 2009 Professor Krste Asanovic Name: Answer Key Notes: This is a closed book, closed notes exam. 80 Minutes 8 Pages Not all questions are
More informationAdvanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012
Advanced Computer Architecture CMSC 611 Homework 3 Due in class Oct 17 th, 2012 (Show your work to receive partial credit) 1) For the following code snippet list the data dependencies and rewrite the code
More informationFloating Point/Multicycle Pipelining in DLX
Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations in one or two cycles is impractical since it requires: A much longer CPU clock cycle, and/or
More informationCase Study 1: Exploring the Impact of Microarchitectural Techniques
6 Solutions to Alternate Case Study Exercises Chapter 2 Solutions Case Study 1: Exploring the Impact of Microarchitectural Techniques 2.1 The baseline performance (in cycles, per loop iteration) of the
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationPipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationStatic vs. Dynamic Scheduling
Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor
More informationELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism
ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationStatic Compiler Optimization Techniques
Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed at improving pipelined CPU performance: Static pipeline scheduling. Loop unrolling. Static branch
More informationOutline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches
Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationProblem M5.1: VLIW Programming
Problem M5.1: VLIW Programming Last updated: Ben Bitdiddle and Louis Reasoner have started a new company called Transbeta and are designing a new processor named Titanium. The Titanium processor is a single-issue
More informationComputer Architecture and Engineering CS152 Quiz #4 April 11th, 2011 Professor Krste Asanović
Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2011 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 17 Pages Notes: Not all questions are
More informationCS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours
CS433 Final Exam Prof Josep Torrellas December 12, 2006 Time: 2 hours Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 6 Questions. Please budget your time. 3. Calculators
More informationReorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)
Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationc. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?
Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined
More informationComputer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution
CSCE 6 (Fall 07) Computer Architecture Homework Set # COVER SHEET Please turn in with your own solution Eun Jung Kim Write your answers on the sheets provided. Submit with the COVER SHEET. If you need
More informationReduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:
Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed
More informationCMSC411 Fall 2013 Midterm 1
CMSC411 Fall 2013 Midterm 1 Name: Instructions You have 75 minutes to take this exam. There are 100 points in this exam, so spend about 45 seconds per point. You do not need to provide a number if you
More informationCS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25
CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to
More informationCourse on Advanced Computer Architectures
Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationCS252 Graduate Computer Architecture Midterm 1 Solutions
CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate
More informationLecture: Pipeline Wrap-Up and Static ILP
Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Multicycle
More informationUniversity of Southern California Department of Electrical Engineering EE557 Fall 2001 Instructor: Michel Dubois Homework #3.
University of Southern California Department of Electrical Engineering EE557 Fall 2001 Instructor: Michel Dubois Homework #3. SOLUTIONS Problem 1 (20pts). There are seven dependences in the C loop presented
More informationLatencies of FP operations used in chapter 4.
Instruction-Level Parallelism (ILP) ILP: refers to the overlap execution of instructions. Pipelined CPI = Ideal pipeline CPI + structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls.
More information06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli
06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationExploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville
Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop
More informationEECS 322 Computer Architecture Superpipline and the Cache
EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:
More informationLecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures
More informationCS232 Final Exam May 5, 2001
CS232 Final Exam May 5, 2 Name: This exam has 4 pages, including this cover. There are six questions, worth a total of 5 points. You have 3 hours. Budget your time! Write clearly and show your work. State
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationLoad1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1
Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationRegisters. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH
PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O T R O L ALU CTL ISTRUCTIO FETCH ISTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMOR ACCESS WRITE BACK A D D A D D A L U
More informationProcessor: Superscalars Dynamic Scheduling
Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),
More informationWhat is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?
What is ILP? Instruction Level Parallelism or Declaration of Independence The characteristic of a program that certain instructions are, and can potentially be. Any mechanism that creates, identifies,
More informationCS 341l Fall 2008 Test #2
CS 341l all 2008 Test #2 Name: Key CS 341l, test #2. 100 points total, number of points each question is worth is indicated in parentheses. Answer all questions. Be as concise as possible while still answering
More informationComputer Architecture CS372 Exam 3
Name: Computer Architecture CS372 Exam 3 This exam has 7 pages. Please make sure you have all of them. Write your name on this page and initials on every other page now. You may only use the green card
More informationLECTURE 10: Improving Memory Access: Direct and Spatial caches
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses
More informationCS152 Computer Architecture and Engineering. Complex Pipelines
CS152 Computer Architecture and Engineering Complex Pipelines Assigned March 6 Problem Set #3 Due March 20 http://inst.eecs.berkeley.edu/~cs152/sp12 The problem sets are intended to help you learn the
More informationRecall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationCSE 490/590 Computer Architecture Homework 2
CSE 490/590 Computer Architecture Homework 2 1. Suppose that you have the following out-of-order datapath with 1-cycle ALU, 2-cycle Mem, 3-cycle Fadd, 5-cycle Fmul, no branch prediction, and in-order fetch
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2018 Static Instruction Scheduling 1 Techniques to reduce stalls CPI = Ideal CPI + Structural stalls per instruction + RAW stalls per instruction + WAR stalls per
More informationInstruction Level Parallelism
Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationUNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568
UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Final Exam - Review Israel Koren ECE568 Final_Exam.1 1. A computer system contains an IOP which may
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationCS 614 COMPUTER ARCHITECTURE II FALL 2005
CS 614 COMPUTER ARCHITECTURE II FALL 2005 DUE : November 9, 2005 HOMEWORK III READ : - Portions of Chapters 5, 6, 7, 8, 9 and 14 of the Sima book and - Portions of Chapters 3, 4, Appendix A and Appendix
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationCPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More informationMetodologie di Progettazione Hardware-Software
Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism
More informationWebsite for Students VTU NOTES QUESTION PAPERS NEWS RESULTS
Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationCS 2410 Mid term (fall 2018)
CS 2410 Mid term (fall 2018) Name: Question 1 (6+6+3=15 points): Consider two machines, the first being a 5-stage operating at 1ns clock and the second is a 12-stage operating at 0.7ns clock. Due to data
More informationCS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences
More informationELE 375 Final Exam Fall, 2000 Prof. Martonosi
ELE 375 Final Exam Fall, 2000 Prof. Martonosi Question Score 1 /10 2 /20 3 /15 4 /15 5 /10 6 /20 7 /20 8 /25 9 /30 10 /30 11 /30 12 /15 13 /10 Total / 250 Please write your answers clearly in the space
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationLecture 16. Today: Start looking into memory hierarchy Cache$! Yay!
Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of
More informationSuper Scalar. Kalyan Basu March 21,
Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build
More informationCS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes
CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.
More informationPage 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationComputer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović
Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013
ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of
More informationComplex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units
6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd
More informationEECS 470 Midterm Exam
EECS 470 Midterm Exam Winter 2014 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: NOTES: # Points Page 2 /12 Page 3
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationCache introduction. April 16, Howard Huang 1
Cache introduction We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? The rest of CS232 focuses on memory and input/output issues, which are frequently
More informationLecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S
Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching
More informationCS161 Design and Architecture of Computer Systems. Cache $$$$$
CS161 Design and Architecture of Computer Systems Cache $$$$$ Memory Systems! How can we supply the CPU with enough data to keep it busy?! We will focus on memory issues,! which are frequently bottlenecks
More information