omputer Design Concept adao Nakamura

omputer Design Concept adao Nakamura akamura@archi.is.tohoku.ac.jp akamura@umunhum.stanford.edu

1 1 Pascal s Calculator

Leibniz s Calculator

Babbage s Calculator

Von Neumann Computer

Flynn s Classification of Computer Architecture

Microprocessor Design Process

Information to Adapt the Specification Requirement Clarification of the Requirement Specification Conceptual Design Concept Upgrade and Improve Design Stems of Mechanical Engineering

Number of Transistors (K) 1,000,000 100,000 10,000 Pentium 1,000 100 8086 i386 10 8085 4004 1 1971 1976 1981 1986 1991 1996 2001 Time Moore s Law 4.3 Billion Transistors in 2014 Pentium II Pentium III Pentium Pro

8-bit internal data bus Accumulator A Status Register SR ALU B D C E Instruction Register IR Clock Generator Control Circuits.., Internal Control Lines H L Stack Pointer SP Program Counter PC Serial IO Port Serial IO... External Control Data Register DR Address Buffer Address AD 0 -AD 7 / Data Address AD 8 -AD 15 Structure of Intel s Microprocessor 8085

1 10 Simple Model of von Neumann Computers

Cycles per Instruction CPI 20.0 10.0 5.0 2.0 1.0 0.5 0.2 0.1 Scalar CISC Superscalar RISC VLIW Superpipeline Scalar RISC 5 10 20 50 100 200 500 1000 Frequency MHz 11 Distribution of Processors in Cycle per Instruction

Clock Cycles 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 DE DE DE DE DE DE DE DE DE : Instruction Fetch DE: Decode & Operand Fetch : Execution : Write Back Instructions Superscalar Processor 12 Pipeline Execution in a Superscalar Processor

Clock Cycles 1 2 3 4 5 6 1 2 3 4 DE DE : Instruction Fetch DE: Decode & Operand Fetch : Execution : Write Back 5 6 7 DE 8 9 Instructions VLIW Processor 1 13 Pipeline Execution in a VLIW Processor

1 2 3 4 5 6 Instructions Clock Cycles 1 2 3 4 5 6 DE DE DE DE DE DE : Instruction Fetch DE: Decode & Operand Fetch : Execution : Write Back Superpipeline Processor 14 Pipeline Execution in a Superpipeline Processor

Main Memory Switch CPU CPU CPU (a) Multiprocessor System Switch CPU CPU CPU Main Memory Main Memory Main Memory (b) Multicomputer System 15 Parallel Computers

X=x1E+x2 Y=y1E-y2 Compare exponents Shift Add Normalize Z=X+Y (a) Floating Point Arithmetic Pipeline X1,Y1 Z1 X1,Y1 Z1 X2,Y2 Z2 X2,Y2 Z2 X3,Y3 Z3 X3,Y3 Z3 1 clock / 1 result (one processor) 4 clocks / 3 results (3 processors) (b) Pipeline Processing (c) Array Processing 16 Floating Point Arithmetic Processing

(a) Scheme of SIMD (b) Scheme of MISD 17 Some Duality of SIMD and MISD

CPU Memory Vector Register Arithmetic Pipeline (a) Vector Computer Memory S W I CPU C ac Local Memory Register (File) T C h e ALU H (b) Parallel Computere 18 Comparison of Vector and Parallel Computers

19 Scalar and Vector Processing in Applications

NOVEL PROGRAMMING LANGUAGE SUPPORT SOFTWARE PARALLEL APPLICATIONS AND ALGORITHMS PARALLEL ARCHITECTURE(S) Paradigm for Application-Driven Parallel Processing

Relations among algorithm, computation model and architecture

More General Relations among algorithm, computation model and architecture

Specification Domain Architecure Domain Conceptual Design Design Concept = Computer Architecture Software Design & Production Domain Machine Instructions Assembler & Assembly Language Semiconductor-Physical Design of Circuit with Devices Chip Implementation CHIP Operating System Compiler High-Level Language Design Flow of Microprocessors

Hierarchy of Computer Architecture

c0 (ADD) ALU M A I N M E M O R Y c1 (READ) c2 (WRITE) D R c6 c4 c7 AC c3 c5 c8 A R PC c10 c9 IR CONTROL UNIT c0 c1 c10 1 Structure of a Simple CPU

Control signal c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 Microoperation AC AC + DR DR M(AR)(READ M) M(AR) DR(WRITE M) RIGHT-SHT AC DR AC AC DR AR DR(ADR) PC DR(ADR) IR DR(OP) PC PC + 1 AR PC 2 Control Signals of the Simple CPU

Begin CPU active? No End Yes AR PC READ M PC PC + 1 IR DR(OP) Decode OP Fetch cycle AC = Accumulator AR = Memory address register DR = Memory data register DR(OP) = Opecode field of DR DR(ADR) = Address field of DR IR = Instruction register M = Main memory PC = Program counter LOAD ADD JUMP AR DR(ADR) READ M AR DR(ADR) READ M Execute cycle AC DR AC AC + DR PC DR(ADR) 3 Operation of an Three-Instruction CPU

External Address Source Control Memory Address Registar Control Memory 1 to 8 Decoder S a 2 a 1 a 0 c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 External Condition Address Field Control Signals

Microprogram 1 FETCH: Microprogram 2 LOAD: Microprogram 3 ADD: Microprogram 4 JUMP: AR PC; READ M; PC PC + 1, IR DR(OP); go to IR; AR DR(ADR); READ M; AC DR; go to FETCH; AR DR(ADR); READ M; AC AC + DR; go to FETCH; PC DR(ADR); go to FETCH; 5 Examples of Microprogarams

Multiplexer External Conditions External Address PC Control Memory CM Microinstruction Register IR Decorder Control Signals 6 Microprogrammed Control Unit

Condition Select Branch Address Control Fields for Control Signals 7 Microinstraction Format

From Instruction Register IR Microprogram Control Unit Control Memory Microinstruction Register npc Nanoprogram Control Unit Control Memory ncm Nanoinstruction Register nir Control Signals 8 Microprogram and Nanoprogram followed

ID OF ID OF Ex : : : : Instruction Fetch Instruction Decode Operand Fetch Execution 9 CISC Instruction Pipeline

Clock Stage Clock Register File Register File Mux B OF Stage Mux B Stage Function Unit Function Unit Mux D Stage Mux D (a) Conventional Datapath (b) Pipelined Datapath 10 Tadapath Timing

Clock Cycle 1 Clock Cycle 2 w x y z w: The control signals are set up. x: The registers are loaded onto the input buses. y: The ALU operates. z: The results back to registers through the output bus. One Datapath Cycle

Microoperation 1 2 3 4 5 6 7 Clock Cycle 1 2 3 4 5 6 7 8 9 OF OF OF OF OF OF OF 3 12 Pipeline Execution for Microoperation Sequence

Clock Cycles 1 2 3 4 5 6 7 8 9 1 2 3 DE DE DE Datapath Domain 4 DE 5 DE 6 DE Instructions : Instruction Fetch DE: Decode and Operand Fetch : Execution : Write Back Control Unit Domain 13 Control Unit and Datapath Domains in Pipelining

ID MEM ID Ex MEM : : : : Instruction Fetch Instruction Decode Execution Memory Read / Write : Write Back 1 RISC Instruction pipeline

PC stage Instruction memory IR Register file DOF stage Instruction decoder Zero fill MUX Data Control Address stage Function unit Data memory Data Address stage MUX Data memory Control Datapath Register file

U Reg A B C ALU U Reg A B C ALU U Reg A B C ALU U Reg A B C ALU U Reg A B C ALU U Reg A B C ALU U Reg A B C ALU U Reg A B C ALU U Reg A B C ALU U Reg A B C ALU U Reg A B C ALU U Reg A B C ALU U Reg A B C ALU U Reg A B C ALU U Reg A B C ALU U A B C ALU Reg INSTRUCTION 1 2 3 4 1 2 3 4 Process of How A Pipeline Works

1 Model of a Superscalar Processor

Program Instruction Fetch & Branch Prediction Window of Execution Instruction Execution Instruction Reorder & Commit Instruction Dispatch Instruction Issue 18 General Model of Superscalar Processors

C = A + B C = A + B E = C + D E = C + D D = F + G J = F + G D = H + I K = H + I 19 Data Dependency

MOV R1, R5 R1<= R5 ID ADD R2,R1,R6 R2<=R1+R6 ID ADD R3,R1,R2 R3<=R1+R2 ID A Data Hazard Problem

MOV R1, R5 ID NOP ID ADD R2,R1,R6 ID NOP ID ADD R3,R1,R2 ID [NOP]:NOPs IN CASE OF SOFTWARE SOLUTION :BUBBLES IN CASE OF HARDWARE SOLUTION 2 Program-Based and Hardware Solutions

1 BZ R1,18 ID 2 MOV R2,R3 [NOP] ID 3 MOV R1,R2 [NOP] ID 4 MOV R5,R6 ID [NOP]:NOPs IN CASE OF SOFTWARE SOLUTION :BUBBLEs IN CASE OF HARDWARE SOLUTION 2 Control Hazard and Its Solution

Cycle Decode Execute Write 1 2 3 4 5 6 7 8 I1 I3 I3 I5 I2 I4 I4 I4 I6 I6 I1 I1 I2 I5 I6 I3 I4 I1 I3 I5 I2 I4 I6 2 In Order Issue - In Order Completion

Cycle Decode Execute Write 1 2 3 4 5 6 7 I1 I3 I5 I2 I4 I4 I6 I6 I1 I1 I2 I5 I6 I3 I4 I2 I1 I4 I5 I6 I3 2 In Order Issue - Out of Order Completion

1 I3 I4 I1 I2 ID 2 3 n INSTRUCTION ISSUE NEYWORK 2 In Order Issue Out of Order Completion Structure

Cycle Decode Window Execute Write 1 2 3 4 5 6 I1 I3 I5 I2 I4 I6 I1, I2 I3, I4 I4, I5, I6 I5 I1 I1 I2 I6 I5 I3 I4 I2 I1 I4 I5 I3 I6 2 Out of Order Issue - Out of Order Completion