Basic Computer Architecture

Similar documents
Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

omputer Design Concept adao Nakamura

Intel released new technology call P6P

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors

EC 513 Computer Architecture

Getting CPI under 1: Outline

The Processor: Instruction-Level Parallelism

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

ARM Processor. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Next Generation Technology from Intel Intel Pentium 4 Processor

CS 101, Mock Computer Architecture

The IA-64 Architecture. Salient Points

Embedded Computing Platform. Architecture and Instruction Set

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CPE300: Digital System Architecture and Design

Intel Enterprise Processors Technology

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

Communications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki

5008: Computer Architecture

SUPERSCALAR AND VLIW PROCESSORS

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design

Microprocessors vs. DSPs (ESC-223)

COMPUTER ORGANIZATION AND DESI

Superscalar Machines. Characteristics of superscalar processors

INSTRUCTION LEVEL PARALLELISM

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

04 - DSP Architecture and Microarchitecture

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Elements of CPU performance

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Multiple Instruction Issue. Superscalars

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

ECE 486/586. Computer Architecture. Lecture # 7

HPC VT Machine-dependent Optimization

Multi-cycle Instructions in the Pipeline (Floating Point)

EE 4980 Modern Electronic Systems. Processor Advanced

Lecture 15: Pipelining. Spring 2018 Jason Tang

Computer Science 146. Computer Architecture

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

ECE 341. Lecture # 15

COMPUTER STRUCTURE AND ORGANIZATION

Implementation of DSP Algorithms

Advanced issues in pipelining

Modern Processors. RISC Architectures

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Chapter 2 Instructions Sets. Hsung-Pin Chang Department of Computer Science National ChungHsing University

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

What is Pipelining? RISC remainder (our assumptions)

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Universität Dortmund. ARM Architecture

Real instruction set architectures. Part 2: a representative sample

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Processing Unit CS206T

Superscalar Processors

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

CS146 Computer Architecture. Fall Midterm Exam

CS425 Computer Systems Architecture

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

The von Neumann Architecture. IT 3123 Hardware and Software Concepts. The Instruction Cycle. Registers. LMC Executes a Store.

Instruction Set Overview

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

COSC 6385 Computer Architecture - Pipelining

COMP2121: Microprocessors and Interfacing. Instruction Set Architecture (ISA)

Lecture 13 - VLIW Machines and Statically Scheduled ILP

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

Multiple Issue ILP Processors. Summary of discussions

IA-32 Architecture COE 205. Computer Organization and Assembly Language. Computer Engineering Department

Where Does The Cpu Store The Address Of The

Chapter 4 The Processor (Part 4)

Pipeline Architecture RISC

Lecture 19: Instruction Level Parallelism

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Computer Organization

CS Computer Architecture

Chapter 9. Pipelining Design Techniques

Static vs. Dynamic Scheduling

Alternate definition: Instruction Set Architecture (ISA) What is Computer Architecture? Computer Organization. Computer structure: Von Neumann model

Advanced processor designs

Handout 2 ILP: Part B

Transcription:

Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an Review of Computer Architecture Credit: Most of the slides are made by Prof. Wayne Wolf who is the author of the textbook. I made some modifications to the note for clarity. Assume some background information from CSCE 430 or equivalent

von Neumann architecture Memory holds data and instructions. Central processing unit (CPU) fetches instructions from memory. Separate CPU and memory distinguishes programmable computer. CPU registers help out: program counter (PC), instruction register (IR), generalpurpose registers, etc. von Neumann Architecture Memory Unit Input Unit CPU Control + ALU Output Unit

CPU + memory address 200 memory ADD r5,r1,r3 data 200 PC CPU ADD IR r5,r1,r3 Recalling Pipelining

Recalling Pipelining What is a potential Problem with von Neumann Architecture? Harvard architecture address data memory program memory data address data PC CPU

von Neumann vs. Harvard Harvard can t use self-modifying code. Harvard allows two simultaneous memory fetches. Most DSPs (e.g Blackfin from ADI) use Harvard architecture for streaming data: greater memory bandwidth. different memory bit depths between instruction and data. more predictable bandwidth. Today s Processors Harvard or von Neumann?

RISC vs. CISC Complex instruction set computer (CISC): many addressing modes; many operations. Reduced instruction set computer (RISC): load/store; pipelinable instructions. Instruction set characteristics Fixed vs. variable length. Addressing modes. Number of operands. Types of operands.

Tensilica Xtensa RISC based variable length But not CISC Programming model Programming model: registers visible to the programmer. Some registers are not visible (IR).

Multiple implementations Successful architectures have several implementations: varying clock speeds; different bus widths; different cache sizes, associativities, configurations; local memory, etc. Assembly language One-to-one with instructions (more or less). Basic features: One instruction per line. Labels provide names for addresses (usually in first column). Instructions often start in later columns. Columns run to end of line.

ARM assembly language example label1 ADR r4,c LDR r0,[r4] ; a comment ADR r4,d LDR r1,[r4] SUB r0,r0,r1 ; comment destination Pseudo-ops Some assembler directives don t correspond directly to instructions: Define current address. Reserve storage. Constants.

Pipelining Execute several instructions simultaneously but at different stages. Simple three-stage pipe: memory fetch decode execute Pipeline complications May not always be able to predict the next instruction: Conditional branch. Causes bubble in the pipeline: fetch decode Execute JNZ fetch decode execute fetch decode execute

Superscalar RISC pipeline executes one instruction per clock cycle (usually). Superscalar machines execute multiple instructions per clock cycle. Faster execution. More variability in execution times. More expensive CPU. Simple superscalar Execute floating point and integer instruction at the same time. Use different registers. Floating point operations use their own hardware unit. Must wait for completion when floating point, integer units communicate.

Costs Good news---can find parallelism at run time. Bad news---causes variations in execution time. Requires a lot of hardware. n 2 instruction unit hardware for n-instruction parallelism. Finding parallelism Independent operations can be performed in parallel: r0 r1 r2 r3 ADD r0, r0, r1 + + ADD r3, r2, r3 ADD r6, r4, r0 r0 r6 + r4 r3

Pipeline hazards Two operations that have data dependency cannot be executed in parallel: x = a + b; a = d + e; y = a - f; a b e d + + x a f - y Order of execution In-order: Machine stops issuing instructions when the next instruction can t be dispatched. Out-of-order: Machine will change order of instructions to keep dispatching. Substantially faster but also more complex.

VLIW architectures Very long instruction word (VLIW) processing provides significant parallelism. Rely on compilers to identify parallelism. What is VLIW? Parallel function units with shared register file: register file function unit function unit function unit... function unit instruction decode and memory

VLIW cluster Organized into clusters to accommodate available register bandwidth: cluster cluster... cluster VLIW and compilers VLIW requires considerably more sophisticated compiler technology than traditional architectures---must be able to extract parallelism to keep the instructions full. Many VLIWs have good compiler support.

Scheduling a b e f a b e c g f c nop d d g nop expressions instructions EPIC EPIC = Explicitly parallel instruction computing. Used in Intel/HP Merced (IA-64) machine. Incorporates several features to allow machine to find, exploit increased parallelism.

IA-64 instruction format Instructions are bundled with tag to indicate which instructions can be executed in parallel: 128 bits tag instruction 1 instruction 2 instruction 3 Memory system CPU fetches data, instructions from a memory hierarchy: Main memory L2 cache L1 cache CPU

Memory hierarchy complications Program behavior is much more statedependent. Depends on how earlier execution left the cache. Execution time is less predictable. Memory access times can vary by 100X. Memory Hierarchy Complication Pentium 3-M Pentium 4-M Pentium M Core P6 (Tualatin 0.13µ) Netburst (Northwood 0.13µ) "P6+" (Banias 0.13µ, Dothan 0.09µ) L1 Cache (data + code) 16Kb + 16Kb 8Kb + 12Kµops (TC) 32Kb + 32Kb L2 Cache 512Kb 512Kb 1024Kb Instructions Sets MMX, SSE MMX, SSE, SSE2 MMX, SSE, SSE2 Max frequencies (CPU/FSB) 1.2GHz 133MHz 2.4GHz 400MHz (QDR) 2GHz 400MHz (QDR) Number of transistors 44M 55M 77M, 140M SpeedStep 2nd generation 2nd generation 3rd generation

End of Overview Next class: Altera Nios II processors