COSC 243. Computer Architecture 2. Lecture 12 Computer Architecture 2. COSC 243 (Computer Architecture)

COSC 243 1

Overview This Lecture Architectural topics CISC RISC Multi-core processors Source: Chapter 15,16 and 17(10 th edition) 2

Moore s Law Gordon E. Moore Co-founder of Intel April 1965 The complexity for minimum component costs has increased at a rate of roughly a factor of two per year 1975 Altered his projection to a doubling every two years Often misquoted as double in speed each 18 months! 3

Moore s Law 61 core Xeon Ph 22 core Xeon Pentium 4 80286 80186 8086 Pentium III Pentium II Pentium Pro Pentium 80486 80386 Core 2 Duo 4004 6502 https://en.wikipedia.org/wiki/moore%27s_law 4

CISC What is the best thing to do with all those transistors? Add extra instructions? Make the CPU do more (integrated cache, etc)? Pipelines? We call these Complex Instruction Set Computers 5

High Level Languages As the cost of a computer dropped the relative cost of software went up As computers became more ubiquitous the need to port software from one machine to another increased As the complexity of software went up the need to use high level languages increased Programs today are almost always written in high level languages As time went on languages became higher You could do more in the same number of lines of code 6

The Semantic Gap A semantic gap appeared. Programming languages are disconnected from CPU architecture This is part of the purpose of high level languages New instructions were added to the CPU, but: They were not being used by programmers Who wrote in high level languages They were not being used by the compilers It wasn t worthwhile re-writing the compiler for each release of a CPU The new instructions were being ignored We need a CPU optimized for high level language use 7

Research In running C programs: Most machine code instructions are call statements Most memory references are due to call statements Perhaps we should optimize those instructions? Operation Run-time Use Instructions Memory Use assignment 38% 13% 15% loop 3% 32% 26% call 12% 33% 45% if 43% 21% 12% goto 3% - - other 1% 1% 1% 8

More Research In running C programs: Most references are to local variables Type Use Constants 23% Variables 53% Arrays / Structures 24% 9

Even More Research 98% of procedures have fewer than 6 parameters 92% of procedures use fewer than 6 local variables Lets optimize the CPU to make the slow parts faster Lets use all those transistors to make faster and simpler CPUs 10

RISC Reduced Instruction Set Computers Three design principles: Large number of registers This reduces the number of memory accesses Careful design of the pipeline for conditional branches Better handling of if statements and procedure calls Simplified instruction set Each instruction does less Fewer addressing modes Often just as many instructions as in a CISC CPU Reduced complexity does not mean reduced number of instructions 11

RISC Characteristics: One instruction per cycle This keeps the pipeline simple Register to register operations the norm Simple addressing modes (often only 1) Simple instruction formats Fixed instruction length Aligned on machine word boundaries (for fast CPU load) 12

RISC Register Windows A large number of registers addressed relative to a start The out registers of one routine are the in registers of the next These are the parameters Each routine has local registers Upon entry to a routine the start is moved to the next block Some arbitrary depth of nesting (often 8 or over) Only necessary to write to memory after this depth of nested calls Some arbitrary number of registers in each windows (8+) Only necessary to write to memory if more than this number is being used A.In A.Local A.Out B.In B.Local B.Out C.In C.Local C.Out D.In D.Local D.Out 13

ARM 32-bit Registers Extra copies of SP and LR in each mode FIRQ has its own set of R8-R12 User System (privileged user) Supervisor (OS) Abort Undefined IRQ FIRQ R0_usr R1_usr R2_usr R3_usr R4_usr R5_usr R6_usr R7_usr R8_usr R8_firq R9_usr R9_firq R10_usr R10_firq R11_usr R11_firq R12_usr R12_firq R13 (SP_usr) SP_svc SP_abt SP_und SP_irq SP_firq R14 (LR_usr) LR_svc LR_abt LR_und LR_irq LR_firq R15 (PC) CSPR spsr_svc spsr_abt spsr_und spsr_irq spsr_firq COSC 243 (Computer Architecture) Lecture 10 - Computer Architecture 2 14

ARM Link Register LR, is used in procedure calls 6502: JSR ; Push the return address (minus 1) onto the stack RTS; Pull return address from the stack (and add 1) ARM: Either: Or BL ; Copy PC to LR and branch BX LR ; Branch to where LR points BL ; Copy PC to LR and branch PUSH {LR}; push LR onto the stack POP {PC}; pull LR off the stack and store in PC COSC 243 (Computer Architecture) Lecture 10 - Computer Architecture 2 15

CISC vs. RISC RISC requires more program instructions than CISC RISC instructions are simplified Have fewer addressing modes Take less memory space to store CISC does more per instruction But the control unit is more complex (and so slower) The microcode is more complex (and so slower) The microcode is often a RISC program! 16

CISC vs. RISC CISC: Minimize instructions per program Increase cycles per instruction RISC: Minimise cycles per instruction Increase instructions per program time program = time cycle cycles instruction instructions program 17

Superpipelines If the RISC (or micro-coded CISC) instructions are so simple then it isn t necessary to use an entire clock cycle to perform each stage of the pipeline So we can double the CPU s internal clock speed and do each of the simple operations in half the time This doubles CPU throughput By halving the time to complete each instruction 18

Superscalar Why use only one pipeline, lets have two. Then we can execute two instructions at once! This is called instruction-level parallelism. Five limitations to instruction-level parallelism True data dependency (write after read, WAR) Output dependency (write after write, WAW) Antidependency (read after write, RAW) Procedural dependency Conditional branches require a pipeline reload Resource conflicts Both pipelines require access to memory at the same time 19

Superscalar However, if there are no dependencies then the instructions need not be executed in the correct order so long as the result is the same This is known as out-of-order execution In-order issue with in-order completion Instructions must start and finish in the correct order In-order issue with out-of-order completion The CPU starts the instructions in order but the second one finished before the first one! Out-of-order issue with out-of-order completion The CPU does the next instruction before the current one! E.g. (TSX then TYA) does the order matter? 20

Superscalar A program is a linear sequence of instructions Instruction fetch with branch prediction Produces an instruction stream Stream examined for dependencies Instructions are re-ordered by their dependencies Instructions are executed based on dependencies on each other and the hardware resources Results are recorded or discarded Discarded in the case of speculative prediction 21

Superscalar Window of Execution Static Program Instruction Fetch and Branch Prediction Produces an Instruction Stream Instruction Execution Instruction Re-order And Commit Instruction Dispatch Instruction Issue 22

Hyperthreading (SMT) We can do more!!!! The CPU slows down when we access memory The pipeline slows down when we have dependencies Can we write programs that do more than one thing at a time but whose parts don t interact (much)? Yes! We can use threading Actually, the OS switches between programs too Perhaps we can build that into the CPU too 23

Hyperthreading (SMT) Imagine a superscalar architecture with 2 pipelines Each pipeline reads from a different part of memory Each pipeline has a separate set of registers If one pipeline becomes stalled the other keeps going Two programs are executed at the same time! This is called Simultaneous Multithreading (SMT) This is the approach of the Intel Hyperthreading CPUs Such as the Pentium 4 and later 24

Heat! The heat dissipation in a transistor is linear in the switching rate. The faster you switch the more heat you get The total amount of heat generated is linear in the number of transistors on the silicon die Both have been following Moore s law! That s a polynomial increase in power over time Intel i7 Extreme is 130W A bright light bulb is 100W 25

Multi-Core How can we reduce the heat? The obvious solution is to go slower How can we go slower and faster at the same time? Instead of having one CPU on the silicon die we put two. They share certain resources including: The buses The level 2 cache 26

Vectors What if you want to do the same operation to each element of an array? One way is to use array (wide) registers and use instructions to perform an operation on each element Instruction decoding only occurs once e.g. PADDD xmm0, xmm1 [3] [2] [1] [0] xmm0 34 56 66 23 xmm1 2 3 4 5 xmm0 36 59 70 28 We call this Single Instruction Multiple Data (SIMD) Intel: SSE (128-bit) AVX (256-bit) AVX512 (512-bit) Arrays of 32/64-bit float or 8/16/32/64-bit integers 27

Classification of Architectures Single instruction, single data (SISD) Normal computer Single instruction, multiple data (SIMD) Intel SSE instructions / etc. Graphics processors Multiple instruction, multiple data (MIMD) Multi-core 28

Coprocessors Graphics co-processors Sound co-processors Math co-processors Physics co-processor Network co-processors Often work by catching illegal instructions and interpreting them for the CPU. 29