EE 4980 Modern Electronic Systems. Processor Advanced

Similar documents
3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Pipelining. CSC Friday, November 6, 2015

ELE 655 Microprocessor System Design

ECE260: Fundamentals of Computer Engineering

CS 230 Practice Final Exam & Actual Take-home Question. Part I: Assembly and Machine Languages (22 pts)

LECTURE 3: THE PROCESSOR

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

COMPUTER ORGANIZATION AND DESI

Chapter 4. The Processor

Complex Pipelines and Branch Prediction

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

omputer Design Concept adao Nakamura

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

ECE331: Hardware Organization and Design

Instruction Pipelining

Instruction Pipelining

Processor Performance and Parallelism Y. K. Malaiya

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator).

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Lecture 15: Pipelining. Spring 2018 Jason Tang

Processor (II) - pipelining. Hwansoo Han

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

CPU Pipelining Issues

Chapter 4. The Processor

Instr. execution impl. view

Full Datapath. Chapter 4 The Processor 2

Outline Marquette University

Static, multiple-issue (superscaler) pipelines

EC 513 Computer Architecture

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Advanced Computer Architecture

Computer Architecture. Lecture 6.1: Fundamentals of

Communications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Cycle Time for Non-pipelined & Pipelined processors

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

administrivia final hour exam next Wednesday covers assembly language like hw and worksheets

Computer Systems Architecture Spring 2016

Control Hazards. Branch Prediction

COMPUTER ORGANIZATION AND DESIGN

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors

The Processor: Instruction-Level Parallelism

Advanced processor designs

1 Hazards COMP2611 Fall 2015 Pipelined Processor

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

04 - DSP Architecture and Microarchitecture

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Full Name: NetID: Midterm Summer 2017

COSC 6385 Computer Architecture - Pipelining

ELC4438: Embedded System Design Embedded Processor

Computer Architecture and Data Manipulation. Von Neumann Architecture

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

Computer Architecture. Fall Dongkun Shin, SKKU

Final Lecture. A few minutes to wrap up and add some perspective

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Full Datapath. Chapter 4 The Processor 2

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSEE 3827: Fundamentals of Computer Systems

Basic Computer Architecture

Intel released new technology call P6P

More advanced CPUs. August 4, Howard Huang 1

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

ELE 455/555 Computer System Engineering. Section 1 Review and Foundations Class 5 Computer System Performance

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

SAE5C Computer Organization and Architecture. Unit : I - V

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

EECS 322 Computer Architecture Superpipline and the Cache

EITF20: Computer Architecture Part2.2.1: Pipeline-1

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Written Exam / Tentamen

ECE 341. Lecture # 15

CS 426 Parallel Computing. Parallel Computing Platforms

Universität Dortmund. ARM Architecture

Computer Architecture

CS 152 Computer Architecture and Engineering

ECE 571 Advanced Microprocessor-Based Design Lecture 4

3.3 Hardware Parallel processing

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Modern Computer Architecture (Processor Design) Prof. Dan Connors

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

LECTURE 10. Pipelining: Advanced ILP

Advanced Memory Organizations

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Computer Architecture Spring 2016

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Transcription:

EE 4980 Modern Electronic Systems Processor Advanced

Architecture General Purpose Processor User Programmable Intended to run end user selected programs Application Independent PowerPoint, Chrome, Twitter, Angry birds, Embedded Processor Not User Programmable Programmed by manufacturer Application Driven Non-smart phone, appliances, missiles, automobiles, Very wide and very deep applications profile 2 tj

Architecture General Purpose Processor Key Characteristics 32/64 bit operations Support non-real-time/time-sharing operating systems Support complex memory systems Multi-level cache dram Virtual memory Support DMA-driven I/O Complex CPU structures Pipelining Superscalar execution Out-of-order execution (OOO) Floating Point HW 3 tj

Architecture General Purpose Processor Examples ARM 7, 9, Cortex A8, A9,A15 Intel Pentiums, Ix, AMD Phenom, Athleron, Opteron Apple A4, A5 TI OMAPs 4 tj

Architecture Embedded Processor Key Characteristics 4/8/16/32 bit operations Support real-time operating systems Relatively simple memory systems Memory mapped I/O Simple CPU structures Few registers Limited Instructions Support for multiple I/O schemes Wide range of peripheral support A/D D/A Sensors Extensive interrupt support 5 tj

Architecture Embedded Processor Examples Motorola/Freescale 68K, HC11, HCS12 ARM Cortex Rx, Mx Atmel AVR 6 tj

Architecture CISC Complex Instruction Set Computer Name didn t even exist until RISC was defined Used in most processors until about 1980 One instruction holds multiple actions Load data from location, add, write data to new location Many times the instructions were designed to emulate high level language constructs RISC Reduced Instruction Set Computer Developed in the 80s Most prevalent architecture today Sometimes called a load/store architecture Instructions are simple Load data from location Add Store data to location RISC dominates today Much easier to take advantage of advanced structures like Pipelining, Superscalar, OOO 7 tj

Introduction Processor Performance Performance improvement of 24,000 X Frequency Improvement of only 660 X How? Source: Computer Architecture, Hennessy and Patterson, 2012 Elsivier Inc 8 tj

Introduction Processor Performance Faster Transistors Larger Die Pipelining Superscaler OoO Execution SISD -> MIMD Memory Hierarchy Moore s Law 200,000 X Performance improvement of 24,000 X Frequency Improvement of only 660 X How? Source: Computer Architecture, Hennessy and Patterson, 2012 Elsivier Inc 9 tj

Architecture Memory Bus Structure von Neumann Harvard UNIFIED MEMORY INSTRUCTION MEMORY DATA MEMORY ADDRESS ADDRESS CONTROL CONTROL ALU CONTROL CONTROL ALU STATUS STATUS 10 tj

Architecture Memory Bus Structure Modified Harvard UNIFIED MEMORY INSTRUCTION MEMORY DATA MEMORY ADDRESS CONTROL CONTROL ALU STATUS 11 tj

Architecture Cache Memory Modified Harvard UNIFIED MEMORY INSTRUCTION MEMORY ADDRESS DATA MEMORY These memories are often augmented by cache memories or are caches themselves CONTROL CONTROL ALU STATUS 12 tj

Architecture Instruction / Data Structures SIMD Single Instruction Multiple Data SISD Single instruction Single Data SIMD INSTRUCTIONS SISD INSTRUCTIONS P P DATA P DATA P MIMD Multiple Instruction Multiple Data MISD Multiple Instruction Single Data MISD INSTRUCTIONS MIMD P INSTRUCTIONS P P P P P DATA DATA P P P P P P 13 tj

Architecture Cache Memory Cache memory is used to store relatively small amounts of data or program for a relatively short amount of time Sit between the processor and the main memory Fast keep them small to make them fast allow the processor to run faster than main memory would allow Leverage the concept of temporal locality If you have recently used a piece of data you are more likely to use it again Leverage the concept of spatial locality Program code and data structures are generally contiguous in memory 14 tj

Architecture Cache Memory Basic Operation Processor requests a byte of program or data The system first checks to see if the byte is already in the cache if Yes read the byte and continue (called a cache hit) if No stall or allow the processor to do something else (called a cache miss) read the byte from main memory into the cache read the byte from the cache and continue If the cache is full and a new byte needs to be loaded several methods can be used to remove an existing byte LRU least recently used byte is removed FIF0 oldest byte loaded is removed 15 tj

Architecture Pipelining Clock Cycle 0 1 2 3 4 5 Waiting D C D Instructions B C D A B C D CPU Execute A B C D Retired Instructions Execute = fetch instruction, decode, execute, write back No Pipeline A B C D A B C A B A 4us 4us 4us 4us 4us 16 tj

Architecture Pipelining Break complex tasks into smaller chunks Start the next instruction as soon as each subtask is complete Clock Cycle 0 1 2 3 4 5 6 7 8 Pipeline Waiting D C D Instructions B C D A B C D Fetch A B C D Decode A B C D Execute A B C D Write back A B C D Retired Instructions A B C D A B C A B A 1us 1us 1us 1us 1us 1us 1us 1us 17 tj

Pipelining Simple Datapath 18 tj

Pipelining 5 Stages of Instruction Execution Fetch (IF) Decode / Register Access (ID) Execute (EX) Memory Access (MEM) Write Back (WB) Pipeline these at 1 stage each 19 tj

Pipelining Pipeline Performance Pipelining does not reduce the time to execute an instruction In fact it usually increases the instruction execution time Pipelining does increase the instruction throughput Time 1000 1000 1000 IF/ID/EX/MEM/WB 1 2 3 Time 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 IF 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 EX 1 2 3 4 5 6 7 8 9 10 11 12 13 MEM 1 2 3 4 5 6 7 8 9 10 11 12 WB 1 2 3 4 5 6 7 8 9 10 11 20 tj

Pipelining Pipeline Performance Non-pipelined 1M Instructions 1x10 9 units of time Pipelined (5 stage) 1M Instructions 2x10 8 5 2x10 8 units of time Overall throughput improvement of 5x 21 tj

Pipelining Pipeline Performance Non-pipelined 1M Instructions 1x10 9 units of time Pipelined (5 stage w/20% penalty per stage) 1M Instructions 2.2x10 8 5 2.2x10 8 units of time Overall throughput improvement of 4.5x 22 tj

Pipelining Pipeline Performance Pipeline stages typically do not all take the same amount of time Stage IF ID/RR EX MEM WB Delay 200ps 100ps 200ps 200ps 100ps Non-pipelined instruction throughput = 1 inst / 800ps Pipelined (5 stage) instruction throughput = 1 inst / 200ps Overall throughput improvement of 4x 23 tj

Pipelining Data Hazards These hazards result from a dependence of one instruction on another instruction still in the pipeline Consider the following code snippit add $s0, $t0, $t1 sub $t2, $s0, $t3 The value of $s0 is needed to perform the subtraction 24 tj

Pipelining Data Hazards add $s0, $t0, $t1 sub $t2, $s0, $t3 Time 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 IF add sub 3 3 4 5 6 7 8 9 10 11 12 13 14 ID add stall stall 3 4 5 6 7 8 9 10 11 12 13 EX add bubble bubble 3 4 5 6 7 8 9 10 11 12 MEM add bubble bubble 3 4 5 6 7 8 9 10 11 WB add bubble bubble 3 4 5 6 7 8 9 10 2 clock cycle bubbles are created It would be 3 bubbles except we can take advantage of our convention writes occur in the first half of the clock cycle reads occur in the second half of the clock cycle the WB occurs during the same clock cycle as the register read 25 tj

Pipelining Data Hazards add $s0, $t0, $t1 sub $t2, $s0, $t3 2 clock cycle bubbles are created It would be 3 bubbles except we can take advantage of our convention writes occur in the first half of the clock cycle reads occur in the second half of the clock cycle the WB occurs during the same clock cycle as the register read 26 tj

Pipelining Control Hazards These hazards result from making a decision while other instructions continue to progress through the pipeline Branch instructions are the most common example don t know whether to load the next instruction or not three approaches stall predict delay 27 tj

Pipelining Control Hazards - stall Do not load the next instruction into the pipeline Time 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 IF add beq 3 3 8 9 10 11 12 13 14 15 16 17 18 ID add beq stall stall 8 9 10 11 12 13 14 15 16 17 EX add beq bubble bubble 8 9 10 11 12 13 14 15 16 MEM add beq bubble bubble 8 9 10 11 12 13 14 15 WB add beq bubble bubble 8 9 10 11 12 13 14 during decode know you have a branch during execute know if taking branch or not PC will be updated Next cycle fetch the next instruction based on PC value 28 tj

Pipelining Control Hazards - stall Even if you add circuitry to detect the branch and update the PC all during the decode can t avoid a stall 29 tj

Pipelining Control Hazards - predict Many algorithms Simplest assume branch will not be taken no penalty if correct stall only when wrong 30 tj

Pipelining Control Hazards predict Predict branch not taken Branch Not Taken Prediction correct! Branch Taken Prediction wrong! 31 tj

Pipelining Control Hazards - predict Static Branch Prediction Predict backward branches - taken Predict forward branches not taken Looping code executes the loop 100 times jumps out of the loop 1 time Dynamic Branch Prediction Keep track of recent branch behavior (for each branch) Assume recent behavior will continue When wrong clear history and start over Hardware intensive 32 tj

Pipelining Mapping the datapath to a pipeline creates a control hazard creates a data hazard 33 tj

Pipelining Pipeline Control 34 tj

Architecture Superscalar Parallelism at the micro-architecture level 35 tj

Introduction Processor Architecture 36 tj

Introduction Processor Architecture 37 tj

Architecture Modern Example 38 tj

Architecture Modern Example 39 tj

Architecture Modern Example 40 tj