Internals of a modern processor. Telecom Paristech
|
|
- Wesley Casey
- 5 years ago
- Views:
Transcription
1 Internals of a modern processor Telecom Paristech Jean-Marie.Cottin@arm.com 1
2 ARM Overview founded in November employees in Cambridge, UK First processor designed for the Newton PDA (Apple) now over 1700 employees (worldwide) In France: Sophia-Antipolis: Multicores, secure cores, L2 caches Grenoble: Physical IP (SOI) Paris: Sales 2
3 What does ARM do? Intellectual Property (IP): R&D outsourcing for semiconductor companies ARM technology is low power Royalty based business model Over 15 billion ARM technology based chips shipped to date Average of 2 ARM processors per mobile phone 3
4 About me Telecom Paris (2000) 東京大学 (2002): formal methods Worked on compilers for 4 years (Japan) Joined ARM (Sophia Antipolis) in 2006: Modelling micro-architectures (cycle accurate model of Cortex-A9 in C++, systemc) Benchmarking (performance metrics of CPUs) Validation Integer cores (Cortex-A9 and next generation Cortex-R) L2 cache controller (PL310) 4
5 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 5
6 About pipelines Why pipelines? Inputs (registers) Combinatorial logic (big) Pipe stage outputs 6
7 Pipeline (instruction level parallelism) Early times: Z80 was not pipelined Several cycles per instructions Pipeline benefits: Split in small tasks frequency up! Parallelize! (IPL: Instruction Level Parallelism) 1 cycle per instruction (ideal) 3 kinds of stall: Data dependencies Contention for hardware resources Control (e.g. branch target) 7
8 Pipeline examples ARM9 (arch. V5) 5 Stage Pipeline ARM11 (arch. V6) 8 Stage Pipeline Cortex-A8 (arch. V7) 13 Stage Pipeline 8
9 Instruction queue Prediction queue Instruction queue and Dispatch Pipeline of the Cortex-A9 (arch. V7) Scheduling Execution Decode Dual-instruction Decode stage Virtual to physical register pool Register Rename stage Issue Stage ALU/MUL ALU-1 ALU ALU-1 FPU/NEON OoO Write back stage Fetch Instruction prefetch stage Fast-loop Instruction cache Branch Monitor Branch Prediction Global History Buffer BR-Target Addr Cache Return Stack Auto-prefetcher AGU Load-Store Unit Store Buffer Dual-instruction cache Decode stage quad-slot with forwarding Data Cache Memory System MMU & Data AMBA 3 AXI 64bit AMBA 3 AXI 64bit 9
10 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 10
11 Fetch: speculation & prediction Speculation: Execute instructions in advance (filling pipe behind branch) and: commit the result in registers whenever control confirmed whenever speculation turned out to be wrong, flush pipe and restart fetching at new address Probabilistic optimization: gain on average Trade-off between performance and power (and complexity) Static prediction: pre-decode immediate branches Dynamic prediction: (Cortex-A9) Branch Target Address Cache (BTAC) Probabilistic model: Predictors & Global History Buffer (GHB) 11
12 Branch Target Address Cache (BTAC) current PC == Tag0 Tag1 Tag2. tagn Target & attributes Branch Target Address Cache (BTAC) associates PC with: predicate is a branch target address and state (ARM,THUMB) Other information (like is a func call, is a func return, etc) Work as a cache (will be presented later) 12
13 Predictors For each branch, a FSM to estimate probability (=taken?) Simplest model: 1 bit Loop: 1 miss at exit + 1 miss at (re)entry Can do better with only 2 bits: (Strongly Weakly) x (Taken Not-Taken) Transitions taken <-> not taken occur after 2 misses T ST T WN N T N T WT N SN Taken Not Taken N 13
14 Not Taken Taken Taken Taken Taken Taken Not Taken Not Taken Taken Not Taken Not Taken Taken Taken Not Taken Not Taken Not Taken Global History Buffer (GHB) PC Improve prediction further! Heuristic tradeoff. # Idea: the same branch gets associated to different predictors according to history of control Catch regular patterns in branch history (e.g. embedded loops) Branch Monitor 14
15 Dynamic prediction Next PC Prefetched PC (+ prediction info) Prefetch Pipe Memory (I-Cache) Instr Stream Instruction word BTAC GHB Challenge: prefetch stage itself is pipelined (3-stages in Cortex-A9) 15
16 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 16
17 Decode: instruction sets Stage itself is a tree of comparators (no conceptual issue) Relevant issues are in the definition of instruction set: 1. Performance 2. Code density 3. Compiler support 4. Legacy code and market 17
18 Performance RISC vs. CISC controversy (hist): CISC: Complex Instruction Set Computer microcoded instructions on register and memory (many addressing modes) RISC: Reduced Instruction Set Computer LD/ST (memory) + simple op. on registers, large register bank Conditional instructions eliminate jumps Multiplications (mixed signed/unsigned, 8/16/32/64), MAC and variations, saturated arithmetic SIMD (Single Instruction Multiple Data): SIMD & Advanced SIMD Extension (ARM s ASE, aka Neon, Intel s MMX & SSE) Applications: Video, Graphics (3D rendering), Signal Processing 18
19 Code density Size of code memory is an issue in embedded: Performance: does the program fit in Instruction-cache? Area on chip: sub-optimal compared to behavioral synthesis Thumb (ARM7TDMI): 16bit But not fully fledged instruction set C code: up to 35% smaller «Code compression» thru variable length instruction sets Thumb2 : 16-32bit Full set Almost same performance as ARM but up to 25% smaller 19
20 Compiler support for SIMD SIMD: multiple operations on a pair of vectors (series of adjacent values) Gain over scalar execution potentially huge! Compiler support limited to loop vectorisation: identify variables with a vector access pattern. Ensure there are no data dependencies between different iterations float *a, *b, x; int i, j, n;... for (i = 0; i < n; i++) { // b is accessed with a stride of 1 *(a+j) = x +b[i]; j += 2; // a is accessed with a stride of 2 }; used thru libraries (OpenMax, OpenGL) 20
21 Legacy code and market Legacy code: Imposes backward compatibility on new architectures Can ease transition with (low perf) emulation mode and virtualization Architectures tend to get irregular and complicate over time: Because of compatibility constraint at first place Because disparate customer requests get included Instruction set delimits a market Patents on instruction (+hardware) to deter competition 21
22 22 Any question so far?
23 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 23
24 Instruction queue Prediction queue Instruction queue and Dispatch Pipeline of the Cortex-A9 (arch. V7) Scheduling Dual-instruction Decode stage Virtual to physical register pool Register Rename stage Issue Stage ALU/MUL ALU-1 ALU ALU-1 FPU/NEON OoO Write back stage Branch Monitor Auto-prefetcher AGU Memory System Instruction prefetch stage Fast-loop Instruction cache Branch Prediction Global History Buffer BR-Target Addr Cache Return Stack Load-Store Unit Store Buffer Dual-instruction Decode stage quad-slot with forwarding Data Cache AMBA 3 AXI 64bit 24
25 Scheduling Static (compiler) or dynamic (processor) Chooses between equivalent sequences of instructions, all respecting data dependencies (semantics of program) + *() x MUL R1, Ra, Rb LDR R2, [Rptr] ADD R3, R1, R2 or LDR R2, [Rptr] MUL R1, Ra, Rb ADD R3, R1, R2 ptr a b Optimize to overlap latencies of instructions Latencies come from: Operator implementations (MUL, DIV, Floating): 1 to 10 s cycles Forward path (lack of) : 1 cycle Contention for resource like access ports to register bank: 1 cycle Memory accesses: 2 to 100 s cycles! 25
26 Data dependencies LDR MUL ADD BIC ADD R2, #Addr R1, R2, R3 R4, R1, R4 R5, #0xF R4, R5, R4 RAW (Read-After-Write) (data transmission) aka «true» data dependency LDR R1, #Addr STR R1, [R9] ADD R1, R2, R4 ADD R4, R1, #3 ADD R4, R1, R4 Loop LDR R2, #Addr MULS R0, R2, R3 MOV R0, #0 ADD R1, R1, R4 WAR: anti-dependency (register reuse) WAW: output dependency WAR & WAW appear because the nbr of registers is limited: aka name dependencies removed by 'renaming' stage in processor 26
27 Static scheduling: 2 shortcomings Compiler has only a (very) simplistic model of latencies Ex: 3 cycles for LDR Rx,[<addr>] ADD.,Rx,. Optimization of hardware resources is a difficult problem: Pipeline scheduling Register allocation Can be stated as an integer linear programming (ILP) problem (Goodwin & Wilken, 1995) NP-complete! Industry compilers: Scheduling and register allocation are two distinct phases Hence, sub-optimal! 27
28 Scheduling vs. register allocation Calling conventions (ABI: Application Binary Interface): scratch registers: can be left modified by functions (r0, r1, r2, r3) saved registers: must be restored at function return Better scheduling tends to increase register use, incurring «register pressure»: Saved registers: More live intermediate values than scratch registers Cost a PUSH/POP Register spill: More live intermediate values than available registers Save and restore intermediate results on stack Very expensive! Register reuse incurs «name dependencies» (WAR & WAW) which artificially limit IPC (Instructions Per Cycle) 28
29 Instruction queue Prediction queue Instruction queue and Dispatch Pipeline of the Cortex-A9 (arch. V7) Scheduling Dual-instruction Decode stage Virtual to physical register pool Register Rename stage Issue Stage ALU/MUL ALU-1 ALU ALU-1 FPU/NEON OoO Write back stage Branch Monitor Auto-prefetcher AGU Memory System Instruction prefetch stage Fast-loop Instruction cache Branch Prediction Global History Buffer BR-Target Addr Cache Return Stack Load-Store Unit Store Buffer Dual-instruction Decode stage quad-slot with forwarding Data Cache AMBA 3 AXI 64bit 29
30 Dynamic scheduling Definition: An ability to execute instructions in a different order than the one chosen by compiler Aka Out-of-order, non-blocking Especially relevant with caches (non-deterministic delays) In practice, pipeline is forked: multiple issue of instruction The issue stage: Keeps track of RAW, WAR & WAW dependencies Dispatch instruction as soon as operands and resources are available Is critical to processor performances! Benefit from register renaming Redo register allocation to a wider set of registers than available in assembler Reduce register pressure: remove WAR & WAW dependencies 30
31 Register renaming: effect Architectural registers Renamed registers Cache miss incurs stall r0 r1 r2 r3 r14 LDR r3, [r6] ADD r0, r3, r5 STR r0, [r2] LDR r0, [r1] ADD r7, r0, #1 STALLED v0 v1 v2 v3 v55 LDR r3, [r6] ADD v0, r3, r5 STR v0, [r2] LDR v1, [r1] ADD r7, v1, #1 STALLED EXECUTED Cache hit WAR 31
32 Register renaming: implementation keep a table of correspondence architectural <-> virtual # virtual registers > # architectural registers Arch. register reads are remapped to current virtual counterpart Arch. register writes allocate a new virtual register (renaming) LDR r3, [r6] ADD r0, r3, r5 STR r0, [r2] LDR r0, [r1] ADD r7, r0, #1 R0 R1 R2 R3 R4 R5 v0 v1 v2 v3 v4 v5 vx vy 1 2 LDR v3, [v6] ADD vx, v3, v5 STR vx, [v2] LDR vy, [v1] ADD v7, vy, #1 1 Rename R0 -> Vx 2 Rename R0 -> Vy 32
33 33 Any question on scheduling?
34 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 34
35 Instruction queue Prediction queue Instruction queue and Dispatch Pipeline of the Cortex-A9 (arch. V7) Dual-instruction Decode stage Virtual to physical register pool Register Rename stage Issue Stage ALU/MUL ALU-1 ALU ALU-1 FPU/NEON OoO Write back stage Instruction prefetch stage Fast-loop Instruction cache Branch Monitor Branch Prediction Global History Buffer BR-Target Addr Cache Return Stack Auto-prefetcher AGU Load-Store Unit Store Buffer Dual-instruction cache Decode stage quad-slot with forwarding Data Cache Memory System MMU & Data AMBA 3 AXI 64bit AMBA 3 AXI 64bit 35
36 Memory Management Unit 0 Physical memory Other processes 0 2^32 Process p1 Process p0 Virtual memory (4GB) MMU memory allocated to p1 memory allocated to p0 2^32 Managed by OS OS needs to provide every process with the same virtual address space For each process, the same address maps to different physical addresses OS just changes the virtual2physical mapping before running a process (page table address register in MMU) 36
37 Memory Management Unit Address seen by process = Virtual address Main memory Virtual page index Page offset 31 n 0 Page Table Page Physical address index 31 n+1 Physical page Page table lookup done by MMU Page Table Page table is located in main memory Pages have attributes: access permission, cacheability It is common to have 2 or 3 levels of translation (Linux: 4K pages) 37
38 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 38
39 Introduction A cache is a copy of a limited amount of main memory into a smaller but faster local memory (proxy) Purpose: speed up accesses! Hardware support: Copies data from main to cache memory on demand Ensures master is updated whenever local copy is modified Makes it mostly transparent to software! Optimization relies on 2 assumptions of locality : In space: accesses are often on the same address or contiguous E.g. data types like structures and arrays Counter example: linked lists, binary trees In time: recently accessed global data is likely to be accessed again This notion of locality depends of the size of the cache 39
40 Memory hierarchy Caches of level 1, 2 or 3 SPEED Integrated or external D$ and I$ (Harvard) vs. Unified (V. Neumann) Latency can vary from 1 to 100 s cycles! Non-deterministic! core L1 cache (32K) L2 cache (256K) Main memory (512M) Mass storage (10G) ON CHIP OFF CHIP Managed by OS 40
41 Implementation Data address (physical or virtual) Tag Index Offset in Line 31 L+S L+S-1 L L-1 0 p-lookups == Tag0 Tag0 Tag1 Tag1 Tag2 Tag2.. Tag-x Tag-x Tag-p Tag-(p-1) Cache line[offset] data0 data1 data2^l 2^s-sets: a set is selected by the index in address p-associative (p lookups in parallel, p = 4..16) Typical line size: 4 to 16 words (128 to 512 bits) Each cache line has valid bit A line is dirty when its data is modified valid dirty 41
42 Maintenance Fine but Incoherencies occur when: Code modifies itself (& Harvard): new instruction is still in data cache and instruction cache (and memory) have the old one! OS switches the context (for virtually addressed caches): remap virtual addresses to a different memory range Cacheability of memory regions changed (MPU or MMU) Several cores access the same global variable Software must then initiate cache maintenance operations, either: invalidate (valid := 0) clean-invalidate (write back to main memory & valid := 0) all cache, by virtual address, by physical address Takes time! 42
43 43 Any question on caches?
44 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 44
45 Multicore CPU (thread level parallelism) Single threaded execution reaching physical limits: frequency -> power -> heat dissipation limits (PCs) Studies show multicore is more energy efficient than singlecore Thread level parallelism (TLP) as opposed to ILP Industry is now moving to multi-cpu systems Software must manage coherency among caches Whenever a global variable is shared among several cores Problems: Software complexity: debug is a pain! Performance loss (some 30% time spent in cache maintenance) Multicore-cpu: hardware support for cache coherency See MOESI/MESI protocols 45
46 Cache coherency (Cortex-A9 MPCore) Configurable Between 1 and 4 CPUs with optional SIMD engine or Floatingpoint Unit FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE Cortex-A9 CPU1 Cortex-A9 CPU2 Cortex-A9 CPU3 Cortex-A9 CPU4 Instruction Cache Data Cache Instruction Cache Data Cache Instruction Cache Data Cache Instruction Cache Data Cache Generalized Interrupt Control and Distribution (GIC) Cache-2-Cache Transfers Snoop Control Unit (SCU) Snoop Filtering Timers Accelerator Coherence Port (ACP) Advanced Bus Interface Unit L2 Cache Controller (PL310) 46
47 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 47
48 Summary & Conclusion At pipeline level, instruction throughput is improved by: Good predictive models for branch speculation Dynamic scheduling Renaming improves scheduling Caches reduce latency but incur non-determinism Multicore extensions support cache coherency for faster inter-process communication Upcoming challenges in processor design: Even more optimizations to hide latencies (e.g. speculative data prefetch) Memory system plays a critical role in overall performance (not just CPU alone)! SoC design complexity is paramount (10 s IP integration) Power consumption is the key consideration! 48
49 Questions? Nintendo DSi (ARM7 & ARM9) Microsoft Zune HD (ARM11 MPcore) Lenovo Skylight (Qualcomm ARM SnapDragon) Nokia N900 (Cortex-A8) 49
William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors
William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,
More informationA superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.
CS 320 Ch. 16 SuperScalar Machines A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle. A superpipelined machine is one in which a
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationECE 571 Advanced Microprocessor-Based Design Lecture 4
ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted
More informationEE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University
EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted
More informationCS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism
CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste
More informationKeyStone II. CorePac Overview
KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB
More information6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU
1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationSuperscalar Processors
Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input
More informationProcessors, Performance, and Profiling
Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode
More informationArchitectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.
Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central
More informationInstruction Level Parallelism
Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationComplex Pipelines and Branch Prediction
Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle
More informationChapter 5. Introduction ARM Cortex series
Chapter 5 Introduction ARM Cortex series 5.1 ARM Cortex series variants 5.2 ARM Cortex A series 5.3 ARM Cortex R series 5.4 ARM Cortex M series 5.5 Comparison of Cortex M series with 8/16 bit MCUs 51 5.1
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationLast lecture. Some misc. stuff An older real processor Class review/overview.
Last lecture Some misc. stuff An older real processor Class review/overview. HW5 Misc. Status issues Answers posted Returned on Wednesday (next week) Project presentation signup at http://tinyurl.com/470w14talks
More informationLecture 9: Superscalar processing
Lecture 9 Superscalar processors Superscalar- processing Stallings: Ch 14 Instruction dependences Register renaming Pentium / PowerPC Goal Concurrent execution of scalar instructions Several independent
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationIn embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.
Lesson 1 Course Notes Review of Computer Architecture Embedded Systems ideal: low power, low cost, high performance Overview of VLIW and ILP What is ILP? It can be seen in: Superscalar In Order Processors
More informationUniversität Dortmund. ARM Architecture
ARM Architecture The RISC Philosophy Original RISC design (e.g. MIPS) aims for high performance through o reduced number of instruction classes o large general-purpose register set o load-store architecture
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationEEM870 Embedded System and Experiment Lecture 3: ARM Processor Architecture
EEM870 Embedded System and Experiment Lecture 3: ARM Processor Architecture Wen-Yen Lin, Ph.D. Department of Electrical Engineering Chang Gung University Email: wylin@mail.cgu.edu.tw March 2014 Agenda
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationLecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )
Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each
More informationCPE300: Digital System Architecture and Design
CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining
More informationARM Processors for Embedded Applications
ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or
More informationEN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design
EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown
More informationIntel released new technology call P6P
P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new
More informationSuperscalar Processors Ch 14
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More informationSuperscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More informationLecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2
Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time
More informationCommunications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki
Communications and Computer Engineering II: Microprocessor 2: Processor Micro-Architecture Lecturer : Tsuyoshi Isshiki Dept. Communications and Computer Engineering, Tokyo Institute of Technology isshiki@ict.e.titech.ac.jp
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationHi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan
Processors Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan chanhl@maili.cgu.edu.twcgu General-purpose p processor Control unit Controllerr Control/ status Datapath ALU
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationCopyright 2016 Xilinx
Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building
More informationPerformance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.
Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution
More informationLecture 19: Instruction Level Parallelism
Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationProcessors. Young W. Lim. May 12, 2016
Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationInstruction Level Parallelism
Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationChapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST
Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationDynamic Scheduling. CSE471 Susan Eggers 1
Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip
More informationLecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ
Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationAdvanced Computer Architecture
Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationChapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs
Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple
More informationEN164: Design of Computing Systems Lecture 24: Processor / ILP 5
EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationStatic & Dynamic Instruction Scheduling
CS3014: Concurrent Systems Static & Dynamic Instruction Scheduling Slides originally developed by Drew Hilton, Amir Roth, Milo Martin and Joe Devietti at University of Pennsylvania 1 Instruction Scheduling
More informationSuperscalar Organization
Superscalar Organization Nima Honarmand Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies between insns. Average
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationPipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationReferences EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)
EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s
More informationComputer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics
Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationSuperscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency
Superscalar Processors Ch 13 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction 1 New dependency for superscalar case? (8) Name dependency (nimiriippuvuus) two use the same
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationRecall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationOut of Order Processing
Out of Order Processing Manu Awasthi July 3 rd 2018 Computer Architecture Summer School 2018 Slide deck acknowledgements : Rajeev Balasubramonian (University of Utah), Computer Architecture: A Quantitative
More informationCS 351 Final Exam Solutions
CS 351 Final Exam Solutions Notes: You must explain your answers to receive partial credit. You will lose points for incorrect extraneous information, even if the answer is otherwise correct. Question
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationPrecise Exceptions and Out-of-Order Execution. Samira Khan
Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all instructions take the same amount of time for execution Idea: Have multiple different functional units that take
More informationComputer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue
More informationRA3 - Cortex-A15 implementation
Formation Cortex-A15 implementation: This course covers Cortex-A15 high-end ARM CPU - Processeurs ARM: ARM Cores RA3 - Cortex-A15 implementation This course covers Cortex-A15 high-end ARM CPU OBJECTIVES
More informationLecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )
Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use
More informationHardware Speculation Support
Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationELC4438: Embedded System Design Embedded Processor
ELC4438: Embedded System Design Embedded Processor Liang Dong Electrical and Computer Engineering Baylor University 1. Processor Architecture General PC Von Neumann Architecture a.k.a. Princeton Architecture
More informationPage 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationArchitectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1
Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationBig.LITTLE Processing with ARM Cortex -A15 & Cortex-A7
Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7 Improving Energy Efficiency in High-Performance Mobile Platforms Peter Greenhalgh, ARM September 2011 This paper presents the rationale and design
More informationThe ARM Cortex-A9 Processors
The ARM Cortex-A9 Processors This whitepaper describes the details of the latest high performance processor design within the common ARM Cortex applications profile ARM Cortex-A9 MPCore processor: A multicore
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More information