COSC 243. Computer Architecture 2. Lecture 12 Computer Architecture 2. COSC 243 (Computer Architecture)

Similar documents
COSC 6385 Computer Architecture - Thread Level Parallelism (I)

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Advanced Processor Architecture

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Processor Architecture

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Computer System Architecture

Superscalar Processors

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

Processor Performance and Parallelism Y. K. Malaiya

RAID 0 (non-redundant) RAID Types 4/25/2011

Multiple Instruction Issue. Superscalars

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

ZYNQ/ARM Interrupt Controller

Intel released new technology call P6P

Chapter 13 Reduced Instruction Set Computers

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Programmazione Avanzata

Superscalar Processors

EE 4980 Modern Electronic Systems. Processor Advanced

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

UNIT- 5. Chapter 12 Processor Structure and Function

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors

Hyperthreading Technology

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Real instruction set architectures. Part 2: a representative sample

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Embedded RISC Microcontroller Core ARM7TDMI

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Organisasi Sistem Komputer

omputer Design Concept adao Nakamura

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Instruction Set And Architectural Features Of A Modern Risc Processor

CS 101, Mock Computer Architecture

Getting CPI under 1: Outline

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

ARM Processors for Embedded Applications

Processor Structure and Function

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

administrivia final hour exam next Wednesday covers assembly language like hw and worksheets

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Fundamentals of Computer Design

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Keywords and Review Questions

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Fundamentals of Computers Design

Final Lecture. A few minutes to wrap up and add some perspective

Performance of computer systems

CISC, RISC and post-risc architectures

Processors, Performance, and Profiling

Microelectronics. Moore s Law. Initially, only a few gates or memory cells could be reliably manufactured and packaged together.

ECE 2162 Intro & Trends. Jun Yang Fall 2009

Review Questions. 1 The DRAM problem [5 points] Suggest a solution. 2 Big versus Little Endian Addressing [5 points]

Outline Marquette University

CS425 Computer Systems Architecture

CISC RISC. Compiler. Compiler. Processor. Processor

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Intel Architecture for Software Developers

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator).

William Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved.

UNIT I (Two Marks Questions & Answers)

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-11: 80x86 Architecture

Tutorial 11. Final Exam Review

Communications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Basic Computer Architecture

Instruction Level Parallelism

ECE 587 Advanced Computer Architecture I

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

COMPUTER ORGANIZATION AND DESI

EC 513 Computer Architecture

ARM ARCHITECTURE. Contents at a glance:

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Computer Systems Architecture

Computer & Microprocessor Architecture HCA103

Parallel Systems I The GPU architecture. Jan Lemeire

ECE 486/586. Computer Architecture. Lecture # 8

Computer Organization. 8 th Edition. Chapter 2 p Computer Evolution and Performance

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Pentium IV-XEON. Computer architectures M

Comp. Org II, Spring

Transcription:

COSC 243 1

Overview This Lecture Architectural topics CISC RISC Multi-core processors Source: Chapter 15,16 and 17(10 th edition) 2

Moore s Law Gordon E. Moore Co-founder of Intel April 1965 The complexity for minimum component costs has increased at a rate of roughly a factor of two per year 1975 Altered his projection to a doubling every two years Often misquoted as double in speed each 18 months! 3

Moore s Law 61 core Xeon Ph 22 core Xeon Pentium 4 80286 80186 8086 Pentium III Pentium II Pentium Pro Pentium 80486 80386 Core 2 Duo 4004 6502 https://en.wikipedia.org/wiki/moore%27s_law 4

CISC What is the best thing to do with all those transistors? Add extra instructions? Make the CPU do more (integrated cache, etc)? Pipelines? We call these Complex Instruction Set Computers 5

High Level Languages As the cost of a computer dropped the relative cost of software went up As computers became more ubiquitous the need to port software from one machine to another increased As the complexity of software went up the need to use high level languages increased Programs today are almost always written in high level languages As time went on languages became higher You could do more in the same number of lines of code 6

The Semantic Gap A semantic gap appeared. Programming languages are disconnected from CPU architecture This is part of the purpose of high level languages New instructions were added to the CPU, but: They were not being used by programmers Who wrote in high level languages They were not being used by the compilers It wasn t worthwhile re-writing the compiler for each release of a CPU The new instructions were being ignored We need a CPU optimized for high level language use 7

Research In running C programs: Most machine code instructions are call statements Most memory references are due to call statements Perhaps we should optimize those instructions? Operation Run-time Use Instructions Memory Use assignment 38% 13% 15% loop 3% 32% 26% call 12% 33% 45% if 43% 21% 12% goto 3% - - other 1% 1% 1% 8

More Research In running C programs: Most references are to local variables Type Use Constants 23% Variables 53% Arrays / Structures 24% 9

Even More Research 98% of procedures have fewer than 6 parameters 92% of procedures use fewer than 6 local variables Lets optimize the CPU to make the slow parts faster Lets use all those transistors to make faster and simpler CPUs 10

RISC Reduced Instruction Set Computers Three design principles: Large number of registers This reduces the number of memory accesses Careful design of the pipeline for conditional branches Better handling of if statements and procedure calls Simplified instruction set Each instruction does less Fewer addressing modes Often just as many instructions as in a CISC CPU Reduced complexity does not mean reduced number of instructions 11

RISC Characteristics: One instruction per cycle This keeps the pipeline simple Register to register operations the norm Simple addressing modes (often only 1) Simple instruction formats Fixed instruction length Aligned on machine word boundaries (for fast CPU load) 12

RISC Register Windows A large number of registers addressed relative to a start The out registers of one routine are the in registers of the next These are the parameters Each routine has local registers Upon entry to a routine the start is moved to the next block Some arbitrary depth of nesting (often 8 or over) Only necessary to write to memory after this depth of nested calls Some arbitrary number of registers in each windows (8+) Only necessary to write to memory if more than this number is being used A.In A.Local A.Out B.In B.Local B.Out C.In C.Local C.Out D.In D.Local D.Out 13

ARM 32-bit Registers Extra copies of SP and LR in each mode FIRQ has its own set of R8-R12 User System (privileged user) Supervisor (OS) Abort Undefined IRQ FIRQ R0_usr R1_usr R2_usr R3_usr R4_usr R5_usr R6_usr R7_usr R8_usr R8_firq R9_usr R9_firq R10_usr R10_firq R11_usr R11_firq R12_usr R12_firq R13 (SP_usr) SP_svc SP_abt SP_und SP_irq SP_firq R14 (LR_usr) LR_svc LR_abt LR_und LR_irq LR_firq R15 (PC) CSPR spsr_svc spsr_abt spsr_und spsr_irq spsr_firq COSC 243 (Computer Architecture) Lecture 10 - Computer Architecture 2 14

ARM Link Register LR, is used in procedure calls 6502: JSR ; Push the return address (minus 1) onto the stack RTS; Pull return address from the stack (and add 1) ARM: Either: Or BL ; Copy PC to LR and branch BX LR ; Branch to where LR points BL ; Copy PC to LR and branch PUSH {LR}; push LR onto the stack POP {PC}; pull LR off the stack and store in PC COSC 243 (Computer Architecture) Lecture 10 - Computer Architecture 2 15

CISC vs. RISC RISC requires more program instructions than CISC RISC instructions are simplified Have fewer addressing modes Take less memory space to store CISC does more per instruction But the control unit is more complex (and so slower) The microcode is more complex (and so slower) The microcode is often a RISC program! 16

CISC vs. RISC CISC: Minimize instructions per program Increase cycles per instruction RISC: Minimise cycles per instruction Increase instructions per program time program = time cycle cycles instruction instructions program 17

Superpipelines If the RISC (or micro-coded CISC) instructions are so simple then it isn t necessary to use an entire clock cycle to perform each stage of the pipeline So we can double the CPU s internal clock speed and do each of the simple operations in half the time This doubles CPU throughput By halving the time to complete each instruction 18

Superscalar Why use only one pipeline, lets have two. Then we can execute two instructions at once! This is called instruction-level parallelism. Five limitations to instruction-level parallelism True data dependency (write after read, WAR) Output dependency (write after write, WAW) Antidependency (read after write, RAW) Procedural dependency Conditional branches require a pipeline reload Resource conflicts Both pipelines require access to memory at the same time 19

Superscalar However, if there are no dependencies then the instructions need not be executed in the correct order so long as the result is the same This is known as out-of-order execution In-order issue with in-order completion Instructions must start and finish in the correct order In-order issue with out-of-order completion The CPU starts the instructions in order but the second one finished before the first one! Out-of-order issue with out-of-order completion The CPU does the next instruction before the current one! E.g. (TSX then TYA) does the order matter? 20

Superscalar A program is a linear sequence of instructions Instruction fetch with branch prediction Produces an instruction stream Stream examined for dependencies Instructions are re-ordered by their dependencies Instructions are executed based on dependencies on each other and the hardware resources Results are recorded or discarded Discarded in the case of speculative prediction 21

Superscalar Window of Execution Static Program Instruction Fetch and Branch Prediction Produces an Instruction Stream Instruction Execution Instruction Re-order And Commit Instruction Dispatch Instruction Issue 22

Hyperthreading (SMT) We can do more!!!! The CPU slows down when we access memory The pipeline slows down when we have dependencies Can we write programs that do more than one thing at a time but whose parts don t interact (much)? Yes! We can use threading Actually, the OS switches between programs too Perhaps we can build that into the CPU too 23

Hyperthreading (SMT) Imagine a superscalar architecture with 2 pipelines Each pipeline reads from a different part of memory Each pipeline has a separate set of registers If one pipeline becomes stalled the other keeps going Two programs are executed at the same time! This is called Simultaneous Multithreading (SMT) This is the approach of the Intel Hyperthreading CPUs Such as the Pentium 4 and later 24

Heat! The heat dissipation in a transistor is linear in the switching rate. The faster you switch the more heat you get The total amount of heat generated is linear in the number of transistors on the silicon die Both have been following Moore s law! That s a polynomial increase in power over time Intel i7 Extreme is 130W A bright light bulb is 100W 25

Multi-Core How can we reduce the heat? The obvious solution is to go slower How can we go slower and faster at the same time? Instead of having one CPU on the silicon die we put two. They share certain resources including: The buses The level 2 cache 26

Vectors What if you want to do the same operation to each element of an array? One way is to use array (wide) registers and use instructions to perform an operation on each element Instruction decoding only occurs once e.g. PADDD xmm0, xmm1 [3] [2] [1] [0] xmm0 34 56 66 23 xmm1 2 3 4 5 xmm0 36 59 70 28 We call this Single Instruction Multiple Data (SIMD) Intel: SSE (128-bit) AVX (256-bit) AVX512 (512-bit) Arrays of 32/64-bit float or 8/16/32/64-bit integers 27

Classification of Architectures Single instruction, single data (SISD) Normal computer Single instruction, multiple data (SIMD) Intel SSE instructions / etc. Graphics processors Multiple instruction, multiple data (MIMD) Multi-core 28

Coprocessors Graphics co-processors Sound co-processors Math co-processors Physics co-processor Network co-processors Often work by catching illegal instructions and interpreting them for the CPU. 29