Pentium IV-XEON. Computer architectures M

Size: px
Start display at page:

Download "Pentium IV-XEON. Computer architectures M"

Transcription

1 Pentium IV-XEON Computer architectures M 1

2 Pentium IV block scheme 4 32 bytes parallel Four access ports to the EU 2

3 Pentium IV block scheme Address Generation Unit BTB Branch Target Buffer I-TLB Instruction TLB D-TLB Data TLB Netburst Architecture I-TLB No instruction cache I!!! 32 GB/s Interface Fetch/Decode CISC Address Generation Unit µ-code ROM Trace Cache Rename Allocation µ-ops Queues Schedulers Trace cache L2 Cache and Control FP Register File Integer Register File FMul Fadd MMX FP move FP store ALU ALU ALU ALU Load Store L1 D-Cache and D-TLB u-ops are directly sent to the FU RS and at the same time inserted into the ROB 3

4 General features Very high processor clock frequency (up to 4GHz) Increased pipeline stages number No L1 instruction cache For the data still I level cache (set associative 8 Kbyte four ways 64 bytes line, has the line width of the II level) II level cache: unified - 256KB 8 ways associative 256 bit parallel (32 bytes in parallel) 128 Bytes line One transfer for each clock edge (32 bytes - ¼ of the line) Bandwith 256 bit x 2 x fclock CPU clock frequency double of the rest of the CPU FSB 1,6 GHz ROB: 126 µ ops BTB: 4K entries But Increased stage number increases the branch penalty Increased clock frequency requires sometimes multiple cycles for the ALU operations and therefore delay for the waiting instructions 4

5 Trace Cache The trace cache has a twofold ways of operation In «execute mode» provides the pipeline stages with the µ ops In case of miss, it retrieves the needed instructions from L2, interpretes the branches (through the BTB), retrieves the instructions speculatively, decodes the instructions and generates a «segment» of µ ops inside Two advantages: no bubble for the branches already predicted and no waste of cache lines (in the normal caches the bytes following a branch are not used because the fetch stops when a branch is encountered while here a line can contain both the branch and the speculated code) Obviously there is always the possibility of mispredictions Since in the majority of cases the instructions are already decoded in the Pentium IV there are are no three decoders but only one because theoretically the need for decoding is reduced In the trace cache there are in general 6 u-ops per line This architecture is member of the architectures with «predecoding» where in the I-cache there are decoded instructions (ie AMD Kx, Nehalem etc ) For complex instructions (>4 µops) instead of filling the trace cache, a tag is inserted which triggers the decoding ROM which provides the µops whenever such instructions are encountered No much change for the efficiency: (1 clock delay): the pipeline is in any case activated Trace cache size: about 12K u-ops (equivalent for Intel to a 16-18K instruction cache) BUT The trace cache is fragile In case of di miss the decoding is made one instruction at a time (a single decoder!) and it was verified that the hit rate was lee than 60% and in the 40%of the cases the system behaves as if only instruction at a time were processed The trace cache was therefore later discarded NB: There is obviously a second BTB per the trace cache 5

6 It was said Intel ha increased the clock frequency (and therefore the number of pipelines stages) only for marketing reasons By reducing the transistors size the clock speed can be increased but the functionality too could be improved This was not the case with the PIV 6

7 Technological characteristics Tecnology 013 micron (130 nm) 217 mm2 42 millions of transistors 423 pin socket 423 Power supply 52 Watts a 14 GHz much greater at 4 GHz! great cooling problem 7

8 Technological characteristics 8

9 Technological characteristics Socket 423 9

10 Execution core Simple execution Unit RAT Port 2 e 3 Arithmetic section Complex execution Unit Memory section (load e store) (it is always possible to imagine that the instructions are sent directly to the ports with a reference number and after execution to the ROB) The dispatch ports (electrical paths to the FUs) are only 4 (0-3) 10

11 Execution core Because of the increased clock frequency the RS should have been increased to avoid the underrun : Intel preferred to add the trace cache Here too3 µ ops per clock are extracted ROB size much greater (126 slots) Only 4 dispatch ports (port 0-3 against the 5 ports of P6) Up to 6 µ ops per clock are sent to the 4 ports (possible because the speed is twice that of the clock positive and negative edges - 2x2clock to high speed ALU + 1 slow ALU + 1LD/ST) Execution port 0: Integer additions and subtractions, logical operations Branch evaluation and store data into the ROB (positive and negative edges) Floating point and SSE set move Execution port 1 Fast integer ALU (only additions and subtractions) Slow ALU ALU (complex integer operations ie shift and rotations which cannot be executed in a single clock) FP e SSE Execution port 2 e 3 Load e Store PlV has 2 (fast) x 2 (u-ops/clock) +1 (slow) ALU = 5 virtual ALUs! 11

12 Branch Prediction As for P6 there are two branch predictors: one dynamic ad one static The BTB consists of 4096 entries (8 ways set associative) and a secret released by Intel In any case it is not enough if there are many nested loops The static predictor is very simple branch conditional forward not taken branch conditional backward taken 12

13 Pentium IV Pipeline TC Nxt IP TC Fetch Drive Trace cache next instruction pointer (and branch verification) Trace cache fetch Driving stage (only for electronic reasons to allow the signals propagaition solo elettronica because of f clock ) Many stages requires two clocks because of the high clock frequency (for instance the transfer from the EU to the ROB) Stages 1-5 At the end of the first 5 stages PIV inserts three u-ops in a u- ops queue (in order) for speed compensation 13

14 Pentium IV Pipeline Allocator/Register renaming Internal registers allocation (128 x integer and 128 FP dynamically allocated ) and ROB insertion If LOAD or STORE => buffer register allocation and reservation for the FU RAT: 128 integer registers, 128 FP, and 128 VPR (vector proc/fp) Queue ROB µops (126 slots) The slots number is double: one set LOAD/STORE and the other for the others Scheduler µ ops scheduling (µ-ops executed only when operands ready Each scheduler (10-12 u-ops) selects the u-ops to sent to the relative ports (in order) Memory Scheduler - Load/Store Unit (LSU) Fast ALU Scheduler - Arithmetic-Logic Unit (simple integer and logical operations) Slow ALU/General FPU Scheduler other ALU functions and the majority of the floating-point Simple FP Scheduler simple FP operations and FP with memory 128 Integer registers 128 Floating point registers Port: bus toward the Functional Units Stages

15 Pentium IV RAT EAX EBX ECX EDX ESI EDI ESP EBP RAT

16 Pipeline Pentium IV Dispatcher (13-14) two stages: the µ ops (ready) are sent to one of the 4 ports (2x2) = 6 the execution units use positive and negative clock edges) (SSE = Streaming SIMD Extension) RF (15-16) Register file The required data are retrieved from the sources and loaded into the EU registers 16

17 Pipeline Pentium IV Ex Execute (17) Flags Execution result BrChk Branch Check: when mispredicted 18 stages back! Drive Rewrite and drive! Stages ,5-4 GHZ Pentium IV has a 31 stages pipeline!! 17

18 Instruction sequence The instructions are retrieved from the L2(128=2 x 64(=256 bit) Bytes a single clock period because the bus uses both clock edges ) Memory address either from BTB (branch) or I-TLB (4096 entries- 4 ways associative) µ-ops 118 bit long in the trace cache 32 GB/s Interfaccia CISC BTB and I-TLB Fetch/Decode (ROM) 2x 256 bit 64 bytes µ-code ROM Trace Cache Rename Allocation µ-ops Queues Schedulers Trace cache 512 L2 Cache and Control FP Register File Integer Register File FMul Fadd MMX FP move FP store ALU ALU ALU ALU Load Store L1 D-Cache and D-TLB 18

19 Instruction sequence BTB: 8 vie 512 lines 8 ways associative 4 bit Yeh prediction Static prediction: Taken backward Untaken forward 32 GB/s Interfaccia BTB and I-TLB CISC Fetch/Decode µ-code ROM Trace Cache Rename Allocation µ-ops Queues Schedulers Trace cache 512 L2 Cache and Control FP Register File Integer Register File FMul Fadd MMX FP move FP store ALU ALU ALU ALU Load Store L1 D-Cache and D-TLB 19

20 Instruction sequence 32 GB/s Interfaccia BTB & I-TLB CISC Fetch/Decode µ-code ROM Trace Cache Rename Allocation µ-ops Queues Schedulers Trace cache L2 Cache and Control FP Register File Integer Register File FMul Fadd MMX FP move FP store ALU ALU ALU ALU Load Store L1 D-Cache and D-TLB 20

21 Instruction sequence Allocation; 3 µ-ops per clock edge Register allocation integer or FP (2X128) 32 GB/s Interfaccia BTB & I-TLB CISC Fetch/Decode µ-code ROM Trace Cache (Drive) Rename Allocation µ-ops Queues Schedulers Trace cache 512 L2 Cache and Control FP Register File Integer Register File FMul Fadd MMX FP move FP store ALU ALU ALU ALU Load Store L1 D-Cache and D-TLB 21

22 Instruction sequence Two schedulers double FIFO queue In order within a queue OOO between the two queues 32 GB/s Interfaccia BTB & I-TLB CISC Fetch/Decode µ-code ROM Trace Cache Rename Allocation µ-ops Queues Schedulers Trace cache 512 L2 Cache and Control FP Register File Integer Register File FMul Fadd MMX FP move FP store ALU ALU ALU ALU Load Store L1 D-Cache and D-TLB 22

23 Instruction sequence µ-ops to the FU when available Up to 6 µ-ops per clock If addressed to the same FU FIFO order The scheduler handles integer and FP u- ops, mispredicted branches and MMX 32 GB/s Interfaccia BTB & I-TLB CISC Fetch/Decode µ-code ROM Trace Cache Rename Allocation µ-ops Queues Schedulers Trace cache 512 L2 Cache and Control FP Register File Integer Register File FMul Fadd MMX FP move FP store ALU ALU ALU ALU Load Store L1 D-Cache and D-TLB 23

24 Instruction sequence Access to integer RF and FP and insertion into the ROB Access to the data cache 32 GB/s Interfaccia BTB & I-TLB CISC Fetch/Decode µ-code ROM Trace Cache Rename Allocation µ-ops Queues Schedulers Trace cache 512 L2 Cache and Control FP Register File Integer Register File FMul Fadd MMX FP move FP store ALU ALU ALU ALU Load Store L1 D-Cache and D-TLB 24

25 Instruction sequence 32 GB/s Interfaccia BTB & I-TLB CISC Fetch/Decode µ-code ROM Trace Cache Rename Allocation µ-ops Queues Schedulers Trace cache 512 L2 Cache and Control FP Register File Integer Register File FMul Fadd MMX FP move FP store ALU ALU ALU ALU Load Store L1 D-Cache and D-TLB DTLB 64 entries full associative - L1 64 bytes lines, write-through 8KB, 4 way associative, 25

26 Instruction sequence Branch check: a stage for branch verification and match with the prediction Cache and CISC BTB prediction update 32 GB/s Interfaccia BTB & I-TLB CISC Fetch/Decode µ-code ROM Trace Cache Rename Allocation µ-ops Queues Schedulers Trace cache 512 L2 Cache and Control FP Register File Integer Register File FMul Fadd MMX FP move FP store ALU ALU ALU ALU Load Store L1 D-Cache and D-TLB 26

27 Complexity vs efficiency Increase more than proportional of the complexity for the efficiency increase (superscalar, branches prediction, OO execution, caches, clock frequency increase) Cost and power consumption increase (Normalized against 486 performance =1) An increase 5-fold of the efficiency required an increase of the complexity 15-fold and and an increase 18-fold of the power consumption 27

28 Multiprocessor But but transistors number increase Modern processors execute threads (which belong either to the same program or to different programs or to the OS) At pipelines and FU levels there can be inefficiencies (for instance a program or a thread doesn t use a FU or a pipeline stage in a clock period) 28

29 Hyperthreading LP LP Pentium IV LP => Logic Processor BUS Threading is the ability of executing at the same time multiple instruction sequences (see Java) which belong either to the same program or to different (similar to multiprocessor) In the previous processors this required multiple concurrent processors Starting from the PENTIUM IV the architecture allows the definition of two «logical» processors each one having its own register sets and able to interface with the bus Each processor can execute its own set of instructions, be interrupted, stop (HALT) etc Each thread is executed OOO 29

30 Hyperthreading Obviously multithreads can be executed on conventional processors too with context switches and through time slice policies The Hyperthreading, on the contrary, allows the the execution of two threads without context switch 30

31 Hyperthreading Implications The executed processes behave as if they were executed on different processors The processor must maintain a copy of the architectural state of each logical processor which use the same hardware resources (in practice instructions and operations must be «tagged» that is they have a tag indicating the processor the belong to) The architectural state consists of all the general registers and of the machine registers The two logical processors share the caches (physically addressed) the BTB (as the caches), the FU and the control circuits (NOT the TLB) Improperly stated the hyperthreading can be though as a pipeline level multiprogramming The die complexity increase is about 5% 31

32 Performance 20% efficiency increase against 5% complexity increase In general the efficiency increase is between 16% and 28% 32

33 Xeon Classical pipeline Pipeline XEON Trace cache Thread 1 Thread 2 Design criteria High reduction of the die size increase (less than 5%) The stall of a «virtual» processor (caches misses, branch predisction error, hazards etc) can t block the oher processor There are queues between the stages and no virtual processor can use the resources 100% The efficiency of the processor when only one thread is executed must be the same of the processor without hyprethreading This implies that the resources must be recombined 34

34 Xeon Two distinct sets of registers: GPRs, Segment, Flag, EIP, x87, MMX, SSE, TR, Machine registers etc The trace cache is equally subdivided per thread (the µ-ops are «tagged») Common stage The yellow and red colours refer to the two threads Different Instruction pointers Partitioned queues ROM To the ROB allocator ommon to the two threads (physical address) distinct ITLBs (two different paging systems for different processes) The decoding is not alternate per clock but for instructions groups 35

35 Xeon Execution Trace In case of acces request overlap arbitration and alternative assignment per clock If a thread is stalled the other executes regularly The TC contains 12K u-ops and is 8 ways set associative All entries are tagged by thread (replacement per thread) Precise replacement LRU (shift register) Caches L1 (data), L2 e L3 (if any) are unified (physical addresses) L1 (data cache) is 4 ways associative, very fast, 64 bytes lines and is write through to L2 There is a common DTLB for L1 4 ways with the entries tagged Pages are 4 KB or 4 MB L2and L3 are 8 ways associative with 128 bytes lines Arbitration for L2 access FCFS with requests queue where at least one slot per thread is reserved Output L2 queues different per thread each with 64 bytes slots BTB partially shared (physical addresses) The branch hystory buffers are different The PHT is unique with tagged entries RSB are obviously different (12 slots per thread) Unified caches can triggers conflicts but also benefits (ie the prefetched instructions of one thread could be used by the other or a thread can use data produced by the others) RAT registers are obviously replicated 36

36 Decode Xeon The decode logic alternates the queues of the threads and maintain two copies of the necessary information As in the Pentium IV there is a instructions microcode ROM per complex From the Trace Cache Allocator Pipeline Out Of Order Split clock access It assigns the u-ops to the 128 slots of the reorder buffer (tagged for the retirement in order), the integer and 128 floating registers, the load (48) and store (24) buffers Buffer are partitioned so that each thread can use only 50% If there are µ-ops available for each clock the service is alternated If a thread ha reached the limit level is stalled: the 50% is never violated By so doing the fairness is granted and the deadlocks are avoided 37

37 Xeon Register Aliasing It assigns to each of the 8 registers of a thread up to 16 registers (16x8=128) It operates in parallel to the allocator After the renamig/allocator stage there are two distinct µ-op quues one of them for the memory Scheduling There are n schedulers, one for each functional unit Up to 6 u-osp can be inserted into the FU per clock Each single u- op is deemed ready for execution according to the availability of the operands and of the functional unit Each scheduler has a 12 entries queue where it extracts the u-op to execute In the same clock period u-ops of the two threads can be extracted for execution A thread cannot use all the queue slots of a funcional unit Execution unit The FU are not concerned with the origin of the u-ops they execute Data are retrieved from the «alias registers of the ROB and write the results in the ROB After the execution the state of the u-ops is RR 38

38 Xeon u-ops commitment The u-ops commitment (that is the writing of data in the architectural registers - or in the L1 - is alternatively executed each clock for the two threads) If there are no u ops of one thread to be retired the commitment of the other thread is not interrupted ST/MT mode The system can handle single threads too, where alla resources are attributed to a single thread according to the programming A A and B B 39

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Superscalar Processors Ch 14

Superscalar Processors Ch 14 Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency? Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale Datapoint 2200 IA-32 Nicholas FitzRoy-Dale At the forefront of the computer revolution - Intel Difficult to explain and impossible to love - Hennessy and Patterson! Released 1970! 2K shift register main

More information

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

How to write powerful parallel Applications

How to write powerful parallel Applications How to write powerful parallel Applications 08:30-09.00 09.00-09:45 09.45-10:15 10:15-10:30 10:30-11:30 11:30-12:30 12:30-13:30 13:30-14:30 14:30-15:15 15:15-15:30 15:30-16:00 16:00-16:45 16:45-17:15 Welcome

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Superscalar Machines. Characteristics of superscalar processors

Superscalar Machines. Characteristics of superscalar processors Superscalar Machines Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any performance

More information

Pentium 4 Processor Block Diagram

Pentium 4 Processor Block Diagram FP FP Pentium 4 Processor Block Diagram FP move FP store FMul FAdd MMX SSE 3.2 GB/s 3.2 GB/s L D-Cache and D-TLB Store Load edulers Integer Integer & I-TLB ucode Netburst TM Micro-architecture Pipeline

More information

Next Generation Technology from Intel Intel Pentium 4 Processor

Next Generation Technology from Intel Intel Pentium 4 Processor Next Generation Technology from Intel Intel Pentium 4 Processor 1 The Intel Pentium 4 Processor Platform Intel s highest performance processor for desktop PCs Targeted at consumer enthusiasts and business

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 9: Limits of ILP, Case Studies Lecture Outline Speculative Execution Implementing Precise Interrupts

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

" # " $ % & ' ( ) * + $ " % '* + * ' "

 #  $ % & ' ( ) * + $  % '* + * ' ! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

The P6 Architecture: Background Information for Developers

The P6 Architecture: Background Information for Developers The P6 Architecture: Background Information for Developers 1995, Intel Corporation P6 CPU Overview CPU Dynamic Execution 133MHz core frequency 8K L1 caches APIC AA AA A AA A A A AA AA AA AA A A AA AA A

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures Superscalar Architectures Have looked at examined basic architecture concepts Starting with simple machines Introduced concepts underlying RISC machines From characteristics of RISC instructions Found

More information

Itanium 2 Processor Microarchitecture Overview

Itanium 2 Processor Microarchitecture Overview Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency Superscalar Processors Ch 13 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction 1 New dependency for superscalar case? (8) Name dependency (nimiriippuvuus) two use the same

More information

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1 Module 2 Embedded Processors and Memory Version 2 EE IIT, Kharagpur 1 Lesson 8 General Purpose Processors - I Version 2 EE IIT, Kharagpur 2 In this lesson the student will learn the following Architecture

More information

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam Assembly Language Lecture 2 - x86 Processor Architecture Ahmed Sallam Introduction to the course Outcomes of Lecture 1 Always check the course website Don t forget the deadline rule!! Motivations for studying

More information

Hardware and Software Architecture. Chapter 2

Hardware and Software Architecture. Chapter 2 Hardware and Software Architecture Chapter 2 1 Basic Components The x86 processor communicates with main memory and I/O devices via buses Data bus for transferring data Address bus for the address of a

More information

XT Node Architecture

XT Node Architecture XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core

More information

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400) CS252 Graduate Computer Architecture Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400) April 4, 2001 Prof. David A. Patterson Computer Science 252 Spring 2001 Lec

More information

UNIT- 5. Chapter 12 Processor Structure and Function

UNIT- 5. Chapter 12 Processor Structure and Function UNIT- 5 Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data Write data CPU With Systems Bus CPU Internal Structure Registers

More information

Case Study IBM PowerPC 620

Case Study IBM PowerPC 620 Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,

More information

CISC, RISC and post-risc architectures

CISC, RISC and post-risc architectures Microprocessor architecture and instruction execution CISC, RISC and Post-RISC architectures Instruction Set Architecture and microarchitecture Instruction encoding and machine instructions Pipelined and

More information

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Lecture 26: Parallel Processing. Spring 2018 Jason Tang Lecture 26: Parallel Processing Spring 2018 Jason Tang 1 Topics Static multiple issue pipelines Dynamic multiple issue pipelines Hardware multithreading 2 Taxonomy of Parallel Architectures Flynn categories:

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Assembly Language. Lecture 2 x86 Processor Architecture

Assembly Language. Lecture 2 x86 Processor Architecture Assembly Language Lecture 2 x86 Processor Architecture Ahmed Sallam Slides based on original lecture slides by Dr. Mahmoud Elgayyar Introduction to the course Outcomes of Lecture 1 Always check the course

More information

IA-32 Architecture COE 205. Computer Organization and Assembly Language. Computer Engineering Department

IA-32 Architecture COE 205. Computer Organization and Assembly Language. Computer Engineering Department IA-32 Architecture COE 205 Computer Organization and Assembly Language Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Basic Computer Organization Intel

More information

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

BOBCAT: AMD S LOW-POWER X86 PROCESSOR ARCHITECTURES FOR MULTIMEDIA SYSTEMS PROF. CRISTINA SILVANO LOW-POWER X86 20/06/2011 AMD Bobcat Small, Efficient, Low Power x86 core Excellent Performance Synthesizable with smaller number of custom arrays

More information

CSE 490/590 Computer Architecture Homework 2

CSE 490/590 Computer Architecture Homework 2 CSE 490/590 Computer Architecture Homework 2 1. Suppose that you have the following out-of-order datapath with 1-cycle ALU, 2-cycle Mem, 3-cycle Fadd, 5-cycle Fmul, no branch prediction, and in-order fetch

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

1. PowerPC 970MP Overview

1. PowerPC 970MP Overview 1. The IBM PowerPC 970MP reduced instruction set computer (RISC) microprocessor is an implementation of the PowerPC Architecture. This chapter provides an overview of the features of the 970MP microprocessor

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC

More information

ECE 341. Lecture # 15

ECE 341. Lecture # 15 ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties

More information

CS 152, Spring 2011 Section 8

CS 152, Spring 2011 Section 8 CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

This Material Was All Drawn From Intel Documents

This Material Was All Drawn From Intel Documents This Material Was All Drawn From Intel Documents A ROAD MAP OF INTEL MICROPROCESSORS Hao Sun February 2001 Abstract The exponential growth of both the power and breadth of usage of the computer has made

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Lecture 9: Superscalar processing

Lecture 9: Superscalar processing Lecture 9 Superscalar processors Superscalar- processing Stallings: Ch 14 Instruction dependences Register renaming Pentium / PowerPC Goal Concurrent execution of scalar instructions Several independent

More information

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)

More information

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L11: Speculative Execution I Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab3 due today 2 1 Overview Branch penalties limit performance

More information

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle. CS 320 Ch. 16 SuperScalar Machines A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle. A superpipelined machine is one in which a

More information

Crusoe Reference. What is Binary Translation. What is so hard about it? Thinking Outside the Box The Transmeta Crusoe Processor

Crusoe Reference. What is Binary Translation. What is so hard about it? Thinking Outside the Box The Transmeta Crusoe Processor Crusoe Reference Thinking Outside the Box The Transmeta Crusoe Processor 55:132/22C:160 High Performance Computer Architecture The Technology Behind Crusoe Processors--Low-power -Compatible Processors

More information

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

CS 152, Spring 2012 Section 8

CS 152, Spring 2012 Section 8 CS 152, Spring 2012 Section 8 Christopher Celio University of California, Berkeley Agenda More Out- of- Order Intel Core 2 Duo (Penryn) Vs. NVidia GTX 280 Intel Core 2 Duo (Penryn) dual- core 2007+ 45nm

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Pentium Pro Case Study ECE/CS 752 Fall 2017

Pentium Pro Case Study ECE/CS 752 Fall 2017 Pentium Pro Case Study ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti Pentium Pro Case Study Microarchitecture

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 03, SPRING 2013

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 03, SPRING 2013 CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 03, SPRING 2013 TOPICS TODAY Moore s Law Evolution of Intel CPUs IA-32 Basic Execution Environment IA-32 General Purpose Registers

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW Computer Architecture ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW 1 Review from Last Lecture Leverage Implicit

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

The Pentium II/III Processor Compiler on a Chip

The Pentium II/III Processor Compiler on a Chip The Pentium II/III Processor Compiler on a Chip Ronny Ronen Senior Principal Engineer Director of Architecture Research Intel Labs - Haifa Intel Corporation Tel Aviv University January 20, 2004 1 Agenda

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function William Stallings Computer Organization and Architecture 8 th Edition Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University Material from: Mostly from Modern Processor Design by Shen and

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions 1. (10 marks) Short answers. CPSC 313, 04w Term 2 Midterm Exam 2 Solutions Date: March 11, 2005; Instructor: Mike Feeley 1a. Give an example of one important CISC feature that is normally not part of a

More information

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01. Hyperthreading ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf Hyperthreading is a design that makes everybody concerned believe that they are actually using

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

ELE 375 Final Exam Fall, 2000 Prof. Martonosi ELE 375 Final Exam Fall, 2000 Prof. Martonosi Question Score 1 /10 2 /20 3 /15 4 /15 5 /10 6 /20 7 /20 8 /25 9 /30 10 /30 11 /30 12 /15 13 /10 Total / 250 Please write your answers clearly in the space

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit Assembly Language for Intel-Based Computers, 4 th Edition Kip R. Irvine Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit Slides prepared by Kip R. Irvine Revision date: 09/25/2002

More information

Basic Computer Architecture

Basic Computer Architecture Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an Review of Computer Architecture Credit: Most of the slides are made by Prof. Wayne Wolf who is the author of the textbook. I

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information