Internals of a modern processor. Telecom Paristech

Size: px
Start display at page:

Download "Internals of a modern processor. Telecom Paristech"

Transcription

1 Internals of a modern processor Telecom Paristech Jean-Marie.Cottin@arm.com 1

2 ARM Overview founded in November employees in Cambridge, UK First processor designed for the Newton PDA (Apple) now over 1700 employees (worldwide) In France: Sophia-Antipolis: Multicores, secure cores, L2 caches Grenoble: Physical IP (SOI) Paris: Sales 2

3 What does ARM do? Intellectual Property (IP): R&D outsourcing for semiconductor companies ARM technology is low power Royalty based business model Over 15 billion ARM technology based chips shipped to date Average of 2 ARM processors per mobile phone 3

4 About me Telecom Paris (2000) 東京大学 (2002): formal methods Worked on compilers for 4 years (Japan) Joined ARM (Sophia Antipolis) in 2006: Modelling micro-architectures (cycle accurate model of Cortex-A9 in C++, systemc) Benchmarking (performance metrics of CPUs) Validation Integer cores (Cortex-A9 and next generation Cortex-R) L2 cache controller (PL310) 4

5 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 5

6 About pipelines Why pipelines? Inputs (registers) Combinatorial logic (big) Pipe stage outputs 6

7 Pipeline (instruction level parallelism) Early times: Z80 was not pipelined Several cycles per instructions Pipeline benefits: Split in small tasks frequency up! Parallelize! (IPL: Instruction Level Parallelism) 1 cycle per instruction (ideal) 3 kinds of stall: Data dependencies Contention for hardware resources Control (e.g. branch target) 7

8 Pipeline examples ARM9 (arch. V5) 5 Stage Pipeline ARM11 (arch. V6) 8 Stage Pipeline Cortex-A8 (arch. V7) 13 Stage Pipeline 8

9 Instruction queue Prediction queue Instruction queue and Dispatch Pipeline of the Cortex-A9 (arch. V7) Scheduling Execution Decode Dual-instruction Decode stage Virtual to physical register pool Register Rename stage Issue Stage ALU/MUL ALU-1 ALU ALU-1 FPU/NEON OoO Write back stage Fetch Instruction prefetch stage Fast-loop Instruction cache Branch Monitor Branch Prediction Global History Buffer BR-Target Addr Cache Return Stack Auto-prefetcher AGU Load-Store Unit Store Buffer Dual-instruction cache Decode stage quad-slot with forwarding Data Cache Memory System MMU & Data AMBA 3 AXI 64bit AMBA 3 AXI 64bit 9

10 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 10

11 Fetch: speculation & prediction Speculation: Execute instructions in advance (filling pipe behind branch) and: commit the result in registers whenever control confirmed whenever speculation turned out to be wrong, flush pipe and restart fetching at new address Probabilistic optimization: gain on average Trade-off between performance and power (and complexity) Static prediction: pre-decode immediate branches Dynamic prediction: (Cortex-A9) Branch Target Address Cache (BTAC) Probabilistic model: Predictors & Global History Buffer (GHB) 11

12 Branch Target Address Cache (BTAC) current PC == Tag0 Tag1 Tag2. tagn Target & attributes Branch Target Address Cache (BTAC) associates PC with: predicate is a branch target address and state (ARM,THUMB) Other information (like is a func call, is a func return, etc) Work as a cache (will be presented later) 12

13 Predictors For each branch, a FSM to estimate probability (=taken?) Simplest model: 1 bit Loop: 1 miss at exit + 1 miss at (re)entry Can do better with only 2 bits: (Strongly Weakly) x (Taken Not-Taken) Transitions taken <-> not taken occur after 2 misses T ST T WN N T N T WT N SN Taken Not Taken N 13

14 Not Taken Taken Taken Taken Taken Taken Not Taken Not Taken Taken Not Taken Not Taken Taken Taken Not Taken Not Taken Not Taken Global History Buffer (GHB) PC Improve prediction further! Heuristic tradeoff. # Idea: the same branch gets associated to different predictors according to history of control Catch regular patterns in branch history (e.g. embedded loops) Branch Monitor 14

15 Dynamic prediction Next PC Prefetched PC (+ prediction info) Prefetch Pipe Memory (I-Cache) Instr Stream Instruction word BTAC GHB Challenge: prefetch stage itself is pipelined (3-stages in Cortex-A9) 15

16 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 16

17 Decode: instruction sets Stage itself is a tree of comparators (no conceptual issue) Relevant issues are in the definition of instruction set: 1. Performance 2. Code density 3. Compiler support 4. Legacy code and market 17

18 Performance RISC vs. CISC controversy (hist): CISC: Complex Instruction Set Computer microcoded instructions on register and memory (many addressing modes) RISC: Reduced Instruction Set Computer LD/ST (memory) + simple op. on registers, large register bank Conditional instructions eliminate jumps Multiplications (mixed signed/unsigned, 8/16/32/64), MAC and variations, saturated arithmetic SIMD (Single Instruction Multiple Data): SIMD & Advanced SIMD Extension (ARM s ASE, aka Neon, Intel s MMX & SSE) Applications: Video, Graphics (3D rendering), Signal Processing 18

19 Code density Size of code memory is an issue in embedded: Performance: does the program fit in Instruction-cache? Area on chip: sub-optimal compared to behavioral synthesis Thumb (ARM7TDMI): 16bit But not fully fledged instruction set C code: up to 35% smaller «Code compression» thru variable length instruction sets Thumb2 : 16-32bit Full set Almost same performance as ARM but up to 25% smaller 19

20 Compiler support for SIMD SIMD: multiple operations on a pair of vectors (series of adjacent values) Gain over scalar execution potentially huge! Compiler support limited to loop vectorisation: identify variables with a vector access pattern. Ensure there are no data dependencies between different iterations float *a, *b, x; int i, j, n;... for (i = 0; i < n; i++) { // b is accessed with a stride of 1 *(a+j) = x +b[i]; j += 2; // a is accessed with a stride of 2 }; used thru libraries (OpenMax, OpenGL) 20

21 Legacy code and market Legacy code: Imposes backward compatibility on new architectures Can ease transition with (low perf) emulation mode and virtualization Architectures tend to get irregular and complicate over time: Because of compatibility constraint at first place Because disparate customer requests get included Instruction set delimits a market Patents on instruction (+hardware) to deter competition 21

22 22 Any question so far?

23 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 23

24 Instruction queue Prediction queue Instruction queue and Dispatch Pipeline of the Cortex-A9 (arch. V7) Scheduling Dual-instruction Decode stage Virtual to physical register pool Register Rename stage Issue Stage ALU/MUL ALU-1 ALU ALU-1 FPU/NEON OoO Write back stage Branch Monitor Auto-prefetcher AGU Memory System Instruction prefetch stage Fast-loop Instruction cache Branch Prediction Global History Buffer BR-Target Addr Cache Return Stack Load-Store Unit Store Buffer Dual-instruction Decode stage quad-slot with forwarding Data Cache AMBA 3 AXI 64bit 24

25 Scheduling Static (compiler) or dynamic (processor) Chooses between equivalent sequences of instructions, all respecting data dependencies (semantics of program) + *() x MUL R1, Ra, Rb LDR R2, [Rptr] ADD R3, R1, R2 or LDR R2, [Rptr] MUL R1, Ra, Rb ADD R3, R1, R2 ptr a b Optimize to overlap latencies of instructions Latencies come from: Operator implementations (MUL, DIV, Floating): 1 to 10 s cycles Forward path (lack of) : 1 cycle Contention for resource like access ports to register bank: 1 cycle Memory accesses: 2 to 100 s cycles! 25

26 Data dependencies LDR MUL ADD BIC ADD R2, #Addr R1, R2, R3 R4, R1, R4 R5, #0xF R4, R5, R4 RAW (Read-After-Write) (data transmission) aka «true» data dependency LDR R1, #Addr STR R1, [R9] ADD R1, R2, R4 ADD R4, R1, #3 ADD R4, R1, R4 Loop LDR R2, #Addr MULS R0, R2, R3 MOV R0, #0 ADD R1, R1, R4 WAR: anti-dependency (register reuse) WAW: output dependency WAR & WAW appear because the nbr of registers is limited: aka name dependencies removed by 'renaming' stage in processor 26

27 Static scheduling: 2 shortcomings Compiler has only a (very) simplistic model of latencies Ex: 3 cycles for LDR Rx,[<addr>] ADD.,Rx,. Optimization of hardware resources is a difficult problem: Pipeline scheduling Register allocation Can be stated as an integer linear programming (ILP) problem (Goodwin & Wilken, 1995) NP-complete! Industry compilers: Scheduling and register allocation are two distinct phases Hence, sub-optimal! 27

28 Scheduling vs. register allocation Calling conventions (ABI: Application Binary Interface): scratch registers: can be left modified by functions (r0, r1, r2, r3) saved registers: must be restored at function return Better scheduling tends to increase register use, incurring «register pressure»: Saved registers: More live intermediate values than scratch registers Cost a PUSH/POP Register spill: More live intermediate values than available registers Save and restore intermediate results on stack Very expensive! Register reuse incurs «name dependencies» (WAR & WAW) which artificially limit IPC (Instructions Per Cycle) 28

29 Instruction queue Prediction queue Instruction queue and Dispatch Pipeline of the Cortex-A9 (arch. V7) Scheduling Dual-instruction Decode stage Virtual to physical register pool Register Rename stage Issue Stage ALU/MUL ALU-1 ALU ALU-1 FPU/NEON OoO Write back stage Branch Monitor Auto-prefetcher AGU Memory System Instruction prefetch stage Fast-loop Instruction cache Branch Prediction Global History Buffer BR-Target Addr Cache Return Stack Load-Store Unit Store Buffer Dual-instruction Decode stage quad-slot with forwarding Data Cache AMBA 3 AXI 64bit 29

30 Dynamic scheduling Definition: An ability to execute instructions in a different order than the one chosen by compiler Aka Out-of-order, non-blocking Especially relevant with caches (non-deterministic delays) In practice, pipeline is forked: multiple issue of instruction The issue stage: Keeps track of RAW, WAR & WAW dependencies Dispatch instruction as soon as operands and resources are available Is critical to processor performances! Benefit from register renaming Redo register allocation to a wider set of registers than available in assembler Reduce register pressure: remove WAR & WAW dependencies 30

31 Register renaming: effect Architectural registers Renamed registers Cache miss incurs stall r0 r1 r2 r3 r14 LDR r3, [r6] ADD r0, r3, r5 STR r0, [r2] LDR r0, [r1] ADD r7, r0, #1 STALLED v0 v1 v2 v3 v55 LDR r3, [r6] ADD v0, r3, r5 STR v0, [r2] LDR v1, [r1] ADD r7, v1, #1 STALLED EXECUTED Cache hit WAR 31

32 Register renaming: implementation keep a table of correspondence architectural <-> virtual # virtual registers > # architectural registers Arch. register reads are remapped to current virtual counterpart Arch. register writes allocate a new virtual register (renaming) LDR r3, [r6] ADD r0, r3, r5 STR r0, [r2] LDR r0, [r1] ADD r7, r0, #1 R0 R1 R2 R3 R4 R5 v0 v1 v2 v3 v4 v5 vx vy 1 2 LDR v3, [v6] ADD vx, v3, v5 STR vx, [v2] LDR vy, [v1] ADD v7, vy, #1 1 Rename R0 -> Vx 2 Rename R0 -> Vy 32

33 33 Any question on scheduling?

34 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 34

35 Instruction queue Prediction queue Instruction queue and Dispatch Pipeline of the Cortex-A9 (arch. V7) Dual-instruction Decode stage Virtual to physical register pool Register Rename stage Issue Stage ALU/MUL ALU-1 ALU ALU-1 FPU/NEON OoO Write back stage Instruction prefetch stage Fast-loop Instruction cache Branch Monitor Branch Prediction Global History Buffer BR-Target Addr Cache Return Stack Auto-prefetcher AGU Load-Store Unit Store Buffer Dual-instruction cache Decode stage quad-slot with forwarding Data Cache Memory System MMU & Data AMBA 3 AXI 64bit AMBA 3 AXI 64bit 35

36 Memory Management Unit 0 Physical memory Other processes 0 2^32 Process p1 Process p0 Virtual memory (4GB) MMU memory allocated to p1 memory allocated to p0 2^32 Managed by OS OS needs to provide every process with the same virtual address space For each process, the same address maps to different physical addresses OS just changes the virtual2physical mapping before running a process (page table address register in MMU) 36

37 Memory Management Unit Address seen by process = Virtual address Main memory Virtual page index Page offset 31 n 0 Page Table Page Physical address index 31 n+1 Physical page Page table lookup done by MMU Page Table Page table is located in main memory Pages have attributes: access permission, cacheability It is common to have 2 or 3 levels of translation (Linux: 4K pages) 37

38 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 38

39 Introduction A cache is a copy of a limited amount of main memory into a smaller but faster local memory (proxy) Purpose: speed up accesses! Hardware support: Copies data from main to cache memory on demand Ensures master is updated whenever local copy is modified Makes it mostly transparent to software! Optimization relies on 2 assumptions of locality : In space: accesses are often on the same address or contiguous E.g. data types like structures and arrays Counter example: linked lists, binary trees In time: recently accessed global data is likely to be accessed again This notion of locality depends of the size of the cache 39

40 Memory hierarchy Caches of level 1, 2 or 3 SPEED Integrated or external D$ and I$ (Harvard) vs. Unified (V. Neumann) Latency can vary from 1 to 100 s cycles! Non-deterministic! core L1 cache (32K) L2 cache (256K) Main memory (512M) Mass storage (10G) ON CHIP OFF CHIP Managed by OS 40

41 Implementation Data address (physical or virtual) Tag Index Offset in Line 31 L+S L+S-1 L L-1 0 p-lookups == Tag0 Tag0 Tag1 Tag1 Tag2 Tag2.. Tag-x Tag-x Tag-p Tag-(p-1) Cache line[offset] data0 data1 data2^l 2^s-sets: a set is selected by the index in address p-associative (p lookups in parallel, p = 4..16) Typical line size: 4 to 16 words (128 to 512 bits) Each cache line has valid bit A line is dirty when its data is modified valid dirty 41

42 Maintenance Fine but Incoherencies occur when: Code modifies itself (& Harvard): new instruction is still in data cache and instruction cache (and memory) have the old one! OS switches the context (for virtually addressed caches): remap virtual addresses to a different memory range Cacheability of memory regions changed (MPU or MMU) Several cores access the same global variable Software must then initiate cache maintenance operations, either: invalidate (valid := 0) clean-invalidate (write back to main memory & valid := 0) all cache, by virtual address, by physical address Takes time! 42

43 43 Any question on caches?

44 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 44

45 Multicore CPU (thread level parallelism) Single threaded execution reaching physical limits: frequency -> power -> heat dissipation limits (PCs) Studies show multicore is more energy efficient than singlecore Thread level parallelism (TLP) as opposed to ILP Industry is now moving to multi-cpu systems Software must manage coherency among caches Whenever a global variable is shared among several cores Problems: Software complexity: debug is a pain! Performance loss (some 30% time spent in cache maintenance) Multicore-cpu: hardware support for cache coherency See MOESI/MESI protocols 45

46 Cache coherency (Cortex-A9 MPCore) Configurable Between 1 and 4 CPUs with optional SIMD engine or Floatingpoint Unit FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE Cortex-A9 CPU1 Cortex-A9 CPU2 Cortex-A9 CPU3 Cortex-A9 CPU4 Instruction Cache Data Cache Instruction Cache Data Cache Instruction Cache Data Cache Instruction Cache Data Cache Generalized Interrupt Control and Distribution (GIC) Cache-2-Cache Transfers Snoop Control Unit (SCU) Snoop Filtering Timers Accelerator Coherence Port (ACP) Advanced Bus Interface Unit L2 Cache Controller (PL310) 46

47 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 47

48 Summary & Conclusion At pipeline level, instruction throughput is improved by: Good predictive models for branch speculation Dynamic scheduling Renaming improves scheduling Caches reduce latency but incur non-determinism Multicore extensions support cache coherency for faster inter-process communication Upcoming challenges in processor design: Even more optimizations to hide latencies (e.g. speculative data prefetch) Memory system plays a critical role in overall performance (not just CPU alone)! SoC design complexity is paramount (10 s IP integration) Power consumption is the key consideration! 48

49 Questions? Nintendo DSi (ARM7 & ARM9) Microsoft Zune HD (ARM11 MPcore) Lenovo Skylight (Qualcomm ARM SnapDragon) Nokia N900 (Cortex-A8) 49

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,

More information

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle. CS 320 Ch. 16 SuperScalar Machines A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle. A superpipelined machine is one in which a

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

KeyStone II. CorePac Overview

KeyStone II. CorePac Overview KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language. Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Chapter 5. Introduction ARM Cortex series

Chapter 5. Introduction ARM Cortex series Chapter 5 Introduction ARM Cortex series 5.1 ARM Cortex series variants 5.2 ARM Cortex A series 5.3 ARM Cortex R series 5.4 ARM Cortex M series 5.5 Comparison of Cortex M series with 8/16 bit MCUs 51 5.1

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Last lecture. Some misc. stuff An older real processor Class review/overview.

Last lecture. Some misc. stuff An older real processor Class review/overview. Last lecture Some misc. stuff An older real processor Class review/overview. HW5 Misc. Status issues Answers posted Returned on Wednesday (next week) Project presentation signup at http://tinyurl.com/470w14talks

More information

Lecture 9: Superscalar processing

Lecture 9: Superscalar processing Lecture 9 Superscalar processors Superscalar- processing Stallings: Ch 14 Instruction dependences Register renaming Pentium / PowerPC Goal Concurrent execution of scalar instructions Several independent

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency. Lesson 1 Course Notes Review of Computer Architecture Embedded Systems ideal: low power, low cost, high performance Overview of VLIW and ILP What is ILP? It can be seen in: Superscalar In Order Processors

More information

Universität Dortmund. ARM Architecture

Universität Dortmund. ARM Architecture ARM Architecture The RISC Philosophy Original RISC design (e.g. MIPS) aims for high performance through o reduced number of instruction classes o large general-purpose register set o load-store architecture

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

EEM870 Embedded System and Experiment Lecture 3: ARM Processor Architecture

EEM870 Embedded System and Experiment Lecture 3: ARM Processor Architecture EEM870 Embedded System and Experiment Lecture 3: ARM Processor Architecture Wen-Yen Lin, Ph.D. Department of Electrical Engineering Chang Gung University Email: wylin@mail.cgu.edu.tw March 2014 Agenda

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections ) Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

Superscalar Processors Ch 14

Superscalar Processors Ch 14 Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency? Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information

Communications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki

Communications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki Communications and Computer Engineering II: Microprocessor 2: Processor Micro-Architecture Lecturer : Tsuyoshi Isshiki Dept. Communications and Computer Engineering, Tokyo Institute of Technology isshiki@ict.e.titech.ac.jp

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan Processors Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan chanhl@maili.cgu.edu.twcgu General-purpose p processor Control unit Controllerr Control/ status Datapath ALU

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Copyright 2016 Xilinx

Copyright 2016 Xilinx Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building

More information

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model. Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution

More information

Lecture 19: Instruction Level Parallelism

Lecture 19: Instruction Level Parallelism Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Processors. Young W. Lim. May 12, 2016

Processors. Young W. Lim. May 12, 2016 Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Static & Dynamic Instruction Scheduling

Static & Dynamic Instruction Scheduling CS3014: Concurrent Systems Static & Dynamic Instruction Scheduling Slides originally developed by Drew Hilton, Amir Roth, Milo Martin and Joe Devietti at University of Pennsylvania 1 Instruction Scheduling

More information

Superscalar Organization

Superscalar Organization Superscalar Organization Nima Honarmand Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies between insns. Average

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions) EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency Superscalar Processors Ch 13 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction 1 New dependency for superscalar case? (8) Name dependency (nimiriippuvuus) two use the same

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Out of Order Processing

Out of Order Processing Out of Order Processing Manu Awasthi July 3 rd 2018 Computer Architecture Summer School 2018 Slide deck acknowledgements : Rajeev Balasubramonian (University of Utah), Computer Architecture: A Quantitative

More information

CS 351 Final Exam Solutions

CS 351 Final Exam Solutions CS 351 Final Exam Solutions Notes: You must explain your answers to receive partial credit. You will lose points for incorrect extraneous information, even if the answer is otherwise correct. Question

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Precise Exceptions and Out-of-Order Execution. Samira Khan

Precise Exceptions and Out-of-Order Execution. Samira Khan Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all instructions take the same amount of time for execution Idea: Have multiple different functional units that take

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

RA3 - Cortex-A15 implementation

RA3 - Cortex-A15 implementation Formation Cortex-A15 implementation: This course covers Cortex-A15 high-end ARM CPU - Processeurs ARM: ARM Cores RA3 - Cortex-A15 implementation This course covers Cortex-A15 high-end ARM CPU OBJECTIVES

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information

Hardware Speculation Support

Hardware Speculation Support Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

ELC4438: Embedded System Design Embedded Processor

ELC4438: Embedded System Design Embedded Processor ELC4438: Embedded System Design Embedded Processor Liang Dong Electrical and Computer Engineering Baylor University 1. Processor Architecture General PC Von Neumann Architecture a.k.a. Princeton Architecture

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1 Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7

Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7 Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7 Improving Energy Efficiency in High-Performance Mobile Platforms Peter Greenhalgh, ARM September 2011 This paper presents the rationale and design

More information

The ARM Cortex-A9 Processors

The ARM Cortex-A9 Processors The ARM Cortex-A9 Processors This whitepaper describes the details of the latest high performance processor design within the common ARM Cortex applications profile ARM Cortex-A9 MPCore processor: A multicore

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information