BOBCAT: AMD S LOW-POWER X86 PROCESSOR

Similar documents
Itanium 2 Processor Microarchitecture Overview

Superscalar Processors

XT Node Architecture

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Pentium IV-XEON. Computer architectures M

Superscalar Machines. Characteristics of superscalar processors

Jim Keller. Digital Equipment Corp. Hudson MA

How to write powerful parallel Applications

1. PowerPC 970MP Overview

Case Study IBM PowerPC 620

The CPU Pipeline. MIPS R4000 Microprocessor User's Manual 43

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Limitations of Scalar Pipelines

Multithreaded Processors. Department of Electrical Engineering Stanford University

Pentium 4 Processor Block Diagram

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

CS 152, Spring 2011 Section 8

EC 513 Computer Architecture

Niagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Sam Naffziger. Gary Hammond. Next Generation Itanium Processor Overview. Lead Circuit Architect Microprocessor Technology Lab HP Corporation

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Portland State University ECE 588/688. Cray-1 and Cray T3E

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Advanced Computer Architecture

Microarchitecture Overview. Performance

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

CS 152 Computer Architecture and Engineering

PowerPC 740 and 750

Hardware-Based Speculation

Microarchitecture Overview. Performance

CS152 Computer Architecture and Engineering. Complex Pipelines

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Superscalar Processors

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

Cache Organizations for Multi-cores

Computer Science 146. Computer Architecture

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

Pentium Pro Case Study ECE/CS 752 Fall 2017

Chapter 5. Introduction ARM Cortex series

HW1 Solutions. Type Old Mix New Mix Cost CPI

ECE/CS 552: Introduction to Superscalar Processors

s complement 1-bit Booth s 2-bit Booth s

Modern Processor Architectures. L25: Modern Compiler Design

AMD s Next Generation Microprocessor Architecture

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

All About the Cell Processor

Handout 2 ILP: Part B

MICROPROCESSOR. Merced Shows Innovative Design. Static, Dynamic Elements Work in Synergy With Compiler

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

The P6 Architecture: Background Information for Developers

CS 152 Computer Architecture and Engineering

EITF20: Computer Architecture Part4.1.1: Cache - 2

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Superscalar Processor

CS 152 Computer Architecture and Engineering

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

CS 152 Computer Architecture and Engineering

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Complex Pipelines and Branch Prediction

CIS 371 Spring 2015 Computer Organization and Design 7 May 2015 Final Exam Answer Key

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

" # " $ % & ' ( ) * + $ " % '* + * ' "

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Out of Order Processing

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

E0-243: Computer Architecture

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

The Nios II Family of Configurable Soft-core Processors

EITF20: Computer Architecture Part4.1.1: Cache - 2

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Advanced processor designs

Inside Intel Core Microarchitecture

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

Next Generation Technology from Intel Intel Pentium 4 Processor

Multiple Instruction Issue. Superscalars

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Intel Core Microarchitecture

One instruction specifies multiple operations All scheduling of execution units is static

Dynamic Scheduling. CSE471 Susan Eggers 1

Memory Hierarchies 2009 DAT105

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

CS146 Computer Architecture. Fall Midterm Exam

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Superscalar Organization

November 7, 2014 Prediction

Performance Characteristics. i960 CA SuperScalar Microprocessor

CS 152 Computer Architecture and Engineering

Transcription:

ARCHITECTURES FOR MULTIMEDIA SYSTEMS PROF. CRISTINA SILVANO LOW-POWER X86 20/06/2011

AMD Bobcat Small, Efficient, Low Power x86 core Excellent Performance Synthesizable with smaller number of custom arrays Easily portable across process technologies 1 of 18

Feature Set: 64-bit AMD64 x86 SIMD extensions: SSE1,SSI2,SSE3,SSE4A Virtualization(AMD-V) Support for misaligned 128-bit data types Instruction Based Sampling C6(with power gating) 2 of 18

Combination of CPU and GPU for high performance compute capability High Speed bus architectures Shared low latency memory model Single Die Design 3 of 18

Micro-Architecture Overview Dual x86 instruction decoder Out of Order(OoO) instruction execution Dual COP retirement Improved Branch Predictor Efficient OoO load/store engine & Hazard Prediction Advanced Virtualization, ASIDs and world switch acceleration Low power C6 state with core level power gating and state save 4 of 18

Micro-Architecture ITLB instruction cache Fetch queue Branch predictor Branch locator Return stack Condition predictor Dynamic target Ucode Dual x86 decoder Reorder buffer Instr queue Int rename FP decode FP rename FP sched Int PRF FP PRF Table walker Mul LAGU SAGU MMX Alu IntMul MMX Alu St Conv DTLB data cache Load/store unit FP logical FP logical Prefetch FPAdd FPMul 512-Kbyte L2 cache Bus unit To/from Northbridge 5 of 18

Micro-Architecture Icache: 32Kbyte, 2-way set associative 64-byte line Parity Protected 512/8 entry ITLB (4k/2m) Fetch up to 32-bytes cycle Branch Predictor: Predicts up to two branches per cycle. Remembers branch instruction locations Return Stack Address Predictor Indirect Dynamic Address Predictor State of the Art condition Predictor Only necessary structures are clocked Table walker DTLB Prefetch ITLB Ucode Reorder buffer Mul data cache 512-Kbyte L2 cache instruction cache Fetch queue Dual x86 decoder Instr queue Int rename Int PRF LAGU Load/store unit Bus unit SAGU Branch predictor Branch locator Return stack Condition predictor Dynamic target MMX Alu IntMul FP logical FPAdd FP decode FP rename FP sched FP PRF MMX Alu St Conv FP logical FPMul To/from Northbridge 6 of 18

Micro-Architecture ITLB instruction cache Fetch queue Branch predictor Branch locator Return stack Condition predictor Dynamic target Dual x86 Decoder: Scans up to 22bytes. Decodes up to two x86 instructions per cycle. The decoder can directly map 89% of x86 instructions to a single microop, an additional 10% to a pair of microops, and more complicated x86 instructions (<1%) are microcoded. (Dynamic Instruction Counts) Table walker DTLB Ucode Reorder buffer Mul data cache Int rename Dual x86 decoder Instr queue Int PRF LAGU Load/store unit SAGU FP decode FP rename FP sched FP PRF MMX Alu MMX Alu IntMul St Conv FP logical FP logical Prefetch FPAdd FPMul 512-Kbyte L2 cache Bus unit To/from Northbridge 7 of 18

Micro-Architecture ITLB instruction cache Fetch queue Branch predictor Branch locator Return stack Condition predictor Dynamic target Integer Execution: A dual port integer scheduler feeds two s. A dual port address scheduler feeds a load address unit, and a store address unit. Physical Register File uses maps and pointers to reduce power by minimizing data copying/ movement. Table walker Ucode Reorder buffer Mul Dual x86 decoder Instr queue Int rename Int PRF LAGU SAGU FP decode FP rename FP sched FP PRF MMX Alu MMX Alu IntMul St Conv DTLB data cache Load/store unit FP logical FP logical Prefetch FPAdd FPMul 512-Kbyte L2 cache Bus unit To/from Northbridge 8 of 18

Micro-Architecture ITLB instruction cache Fetch queue Branch predictor Branch locator Return stack Condition predictor Dynamic target Floating Point Unit: A centralized FP scheduler feeds two 64-bit FP execution stacks. MMX and Logical Unit can perform two SP multiplies per cycle. The FP Mul Unit can perform two SP multiplies per cycle. The FP Add Unit can perform two SP additions per cycle. A physical register file is used to reduce. Table walker DTLB Prefetch Ucode Reorder buffer Mul data cache Dual x86 decoder Instr queue Int rename Int PRF LAGU Load/store unit SAGU FP decode FP rename FP sched FP PRF MMX Alu MMX Alu IntMul St Conv FP logical FP logical FPAdd FPMul 512-Kbyte L2 cache Bus unit To/from Northbridge 9 of 18

Micro-Architecture ITLB instruction cache Fetch queue Branch predictor Branch locator Return stack Condition predictor Dynamic target Ucode Dual x86 decoder Data Cache: 8-way set associative. 64-byte line. Parity Protected. Copyback. 40/8 entry L1DTLB (4k/2m) 512/64 entry L2DTLB (4k/2m) Advanced 8-stream prefetcher. Table walker DTLB Reorder buffer Mul data cache Instr queue Int rename Int PRF LAGU Load/store unit SAGU FP decode FP rename FP sched FP PRF MMX Alu MMX Alu IntMul St Conv FP logical FP logical Prefetch FPAdd FPMul 512-Kbyte L2 cache Bus unit To/from Northbridge 10 of 18

Micro-Architecture ITLB instruction cache Fetch queue Branch predictor Branch locator Return stack Condition predictor Dynamic target Out-of-Order Load Store Unit: Loads bypassing loads Loads bypassing stores Stores bypassing loads Bypass tracking and dependency correction Hazard predictor Fast store forwarding Fast critical word fill forwarding. Table walker DTLB Ucode Reorder buffer Mul data cache Dual x86 decoder Instr queue Int rename Int PRF LAGU Load/store unit SAGU FP decode FP rename FP sched FP PRF MMX Alu MMX Alu IntMul St Conv FP logical FP logical Prefetch FPAdd FPMul 512-Kbyte L2 cache Bus unit To/from Northbridge 11 of 18

Micro-Architecture L2 Cache: 512 Kbyte 16-Way set associative 64 Byte lines ECC Protected Half speed clocking for power reduction ITLB Ucode Reorder buffer instruction cache Fetch queue Dual x86 decoder Instr queue Int rename Int PRF Branch predictor Branch locator Return stack Condition predictor Dynamic target FP decode FP rename FP sched FP PRF Bus Unit: 8-Outstanding data accesses 2-Outstanding fetch accesses Eviction Buffers Fill Buffers Write combining buffers Coherency management Table walker DTLB Prefetch Mul data cache 512-Kbyte L2 cache LAGU Load/store unit Bus unit SAGU MMX Alu MMX Alu IntMul St Conv FP logical FP logical FPAdd FPMul To/from Northbridge 12 of 18

Pipeline 0 1 2 3 4 5 6 7 8 9 10 11 12 Cond predict Sparse taken resteer Dense taken resteer Check/Ind address resteer Branch mispredict loop : 13 cycles Sparse BP Dense BP Ind BP Target addr check Microcode ROM MDec Fetch0 Fetch1 Fetch2 Fetch3 Fetch4 Fetch5 utag TLB Tag Way Write IB Data PreDec LenDec InstDec Pack Length decode Decode illegal, Dbls, Fast/ROM Lane fill, write IQ FDec Dispatch Sched Token allocation Cop dec, rename Write SQs Mispredict transit RegRead Writeback AGU Transit to FPU FP decode Stack rename Reg rename Write FP SQ Schedule RegRead EXE Agen Drive, write MOQ DC1 DC2 TLB Hit Data Muxing, drive Tag 3-cycle load 13 of 18

Physical Design Ontario/Zacate Accelarated Processing Unit Power Gating on most die units Fusion Architecture enables video transcode&image processing 14 of 18

Power Reduction Use of Physical Register files Non-Shifting queues with pointers Clock Gating Integrated Core Power Gating Clocking arrays when needed - Predicting the type of branch then clocking the appropriate predictor Elimination of Instruction marker bits in Icache Speed path polishing in order to raise the Vt mix and reduce leakage 15 of 18

Overview BOBCAT is the CPU engine for AMD s first APU Provides %90 of the today s mainstream notebooks performance in the half area. Highly Portable across designs Sub-one watt capable core 16 of 18

References Brad Burgess, Brad Cohen, Marvin Denman, Jim Dundas, David Kaplan, Jeff Rupley, "Bobcat: AMD's Low- Power x86 Processor," IEEE Micro, vol. 31, no. 2, pp. 16-25, Mar./Apr. 2011, doi: 10.1109/MM.2011.2 Brad Burgess AMD's "Bobcat" x86 Core - Small, Efficient and Strong, Hot Chips 22, August 22-24, 2010, Memorial Auditorium, Stanford University 17 of 18

THANK YOU