BOBCAT: AMD S LOW-POWER X86 PROCESSOR

ARCHITECTURES FOR MULTIMEDIA SYSTEMS PROF. CRISTINA SILVANO LOW-POWER X86 20/06/2011

AMD Bobcat Small, Efficient, Low Power x86 core Excellent Performance Synthesizable with smaller number of custom arrays Easily portable across process technologies 1 of 18

Feature Set: 64-bit AMD64 x86 SIMD extensions: SSE1,SSI2,SSE3,SSE4A Virtualization(AMD-V) Support for misaligned 128-bit data types Instruction Based Sampling C6(with power gating) 2 of 18

Combination of CPU and GPU for high performance compute capability High Speed bus architectures Shared low latency memory model Single Die Design 3 of 18

Micro-Architecture Overview Dual x86 instruction decoder Out of Order(OoO) instruction execution Dual COP retirement Improved Branch Predictor Efficient OoO load/store engine & Hazard Prediction Advanced Virtualization, ASIDs and world switch acceleration Low power C6 state with core level power gating and state save 4 of 18

Micro-Architecture ITLB instruction cache Fetch queue Branch predictor Branch locator Return stack Condition predictor Dynamic target Ucode Dual x86 decoder Reorder buffer Instr queue Int rename FP decode FP rename FP sched Int PRF FP PRF Table walker Mul LAGU SAGU MMX Alu IntMul MMX Alu St Conv DTLB data cache Load/store unit FP logical FP logical Prefetch FPAdd FPMul 512-Kbyte L2 cache Bus unit To/from Northbridge 5 of 18

Micro-Architecture Icache: 32Kbyte, 2-way set associative 64-byte line Parity Protected 512/8 entry ITLB (4k/2m) Fetch up to 32-bytes cycle Branch Predictor: Predicts up to two branches per cycle. Remembers branch instruction locations Return Stack Address Predictor Indirect Dynamic Address Predictor State of the Art condition Predictor Only necessary structures are clocked Table walker DTLB Prefetch ITLB Ucode Reorder buffer Mul data cache 512-Kbyte L2 cache instruction cache Fetch queue Dual x86 decoder Instr queue Int rename Int PRF LAGU Load/store unit Bus unit SAGU Branch predictor Branch locator Return stack Condition predictor Dynamic target MMX Alu IntMul FP logical FPAdd FP decode FP rename FP sched FP PRF MMX Alu St Conv FP logical FPMul To/from Northbridge 6 of 18

Micro-Architecture ITLB instruction cache Fetch queue Branch predictor Branch locator Return stack Condition predictor Dynamic target Dual x86 Decoder: Scans up to 22bytes. Decodes up to two x86 instructions per cycle. The decoder can directly map 89% of x86 instructions to a single microop, an additional 10% to a pair of microops, and more complicated x86 instructions (<1%) are microcoded. (Dynamic Instruction Counts) Table walker DTLB Ucode Reorder buffer Mul data cache Int rename Dual x86 decoder Instr queue Int PRF LAGU Load/store unit SAGU FP decode FP rename FP sched FP PRF MMX Alu MMX Alu IntMul St Conv FP logical FP logical Prefetch FPAdd FPMul 512-Kbyte L2 cache Bus unit To/from Northbridge 7 of 18

Micro-Architecture ITLB instruction cache Fetch queue Branch predictor Branch locator Return stack Condition predictor Dynamic target Integer Execution: A dual port integer scheduler feeds two s. A dual port address scheduler feeds a load address unit, and a store address unit. Physical Register File uses maps and pointers to reduce power by minimizing data copying/ movement. Table walker Ucode Reorder buffer Mul Dual x86 decoder Instr queue Int rename Int PRF LAGU SAGU FP decode FP rename FP sched FP PRF MMX Alu MMX Alu IntMul St Conv DTLB data cache Load/store unit FP logical FP logical Prefetch FPAdd FPMul 512-Kbyte L2 cache Bus unit To/from Northbridge 8 of 18

Micro-Architecture ITLB instruction cache Fetch queue Branch predictor Branch locator Return stack Condition predictor Dynamic target Floating Point Unit: A centralized FP scheduler feeds two 64-bit FP execution stacks. MMX and Logical Unit can perform two SP multiplies per cycle. The FP Mul Unit can perform two SP multiplies per cycle. The FP Add Unit can perform two SP additions per cycle. A physical register file is used to reduce. Table walker DTLB Prefetch Ucode Reorder buffer Mul data cache Dual x86 decoder Instr queue Int rename Int PRF LAGU Load/store unit SAGU FP decode FP rename FP sched FP PRF MMX Alu MMX Alu IntMul St Conv FP logical FP logical FPAdd FPMul 512-Kbyte L2 cache Bus unit To/from Northbridge 9 of 18

Micro-Architecture ITLB instruction cache Fetch queue Branch predictor Branch locator Return stack Condition predictor Dynamic target Ucode Dual x86 decoder Data Cache: 8-way set associative. 64-byte line. Parity Protected. Copyback. 40/8 entry L1DTLB (4k/2m) 512/64 entry L2DTLB (4k/2m) Advanced 8-stream prefetcher. Table walker DTLB Reorder buffer Mul data cache Instr queue Int rename Int PRF LAGU Load/store unit SAGU FP decode FP rename FP sched FP PRF MMX Alu MMX Alu IntMul St Conv FP logical FP logical Prefetch FPAdd FPMul 512-Kbyte L2 cache Bus unit To/from Northbridge 10 of 18

Micro-Architecture ITLB instruction cache Fetch queue Branch predictor Branch locator Return stack Condition predictor Dynamic target Out-of-Order Load Store Unit: Loads bypassing loads Loads bypassing stores Stores bypassing loads Bypass tracking and dependency correction Hazard predictor Fast store forwarding Fast critical word fill forwarding. Table walker DTLB Ucode Reorder buffer Mul data cache Dual x86 decoder Instr queue Int rename Int PRF LAGU Load/store unit SAGU FP decode FP rename FP sched FP PRF MMX Alu MMX Alu IntMul St Conv FP logical FP logical Prefetch FPAdd FPMul 512-Kbyte L2 cache Bus unit To/from Northbridge 11 of 18

Micro-Architecture L2 Cache: 512 Kbyte 16-Way set associative 64 Byte lines ECC Protected Half speed clocking for power reduction ITLB Ucode Reorder buffer instruction cache Fetch queue Dual x86 decoder Instr queue Int rename Int PRF Branch predictor Branch locator Return stack Condition predictor Dynamic target FP decode FP rename FP sched FP PRF Bus Unit: 8-Outstanding data accesses 2-Outstanding fetch accesses Eviction Buffers Fill Buffers Write combining buffers Coherency management Table walker DTLB Prefetch Mul data cache 512-Kbyte L2 cache LAGU Load/store unit Bus unit SAGU MMX Alu MMX Alu IntMul St Conv FP logical FP logical FPAdd FPMul To/from Northbridge 12 of 18

Pipeline 0 1 2 3 4 5 6 7 8 9 10 11 12 Cond predict Sparse taken resteer Dense taken resteer Check/Ind address resteer Branch mispredict loop : 13 cycles Sparse BP Dense BP Ind BP Target addr check Microcode ROM MDec Fetch0 Fetch1 Fetch2 Fetch3 Fetch4 Fetch5 utag TLB Tag Way Write IB Data PreDec LenDec InstDec Pack Length decode Decode illegal, Dbls, Fast/ROM Lane fill, write IQ FDec Dispatch Sched Token allocation Cop dec, rename Write SQs Mispredict transit RegRead Writeback AGU Transit to FPU FP decode Stack rename Reg rename Write FP SQ Schedule RegRead EXE Agen Drive, write MOQ DC1 DC2 TLB Hit Data Muxing, drive Tag 3-cycle load 13 of 18

Physical Design Ontario/Zacate Accelarated Processing Unit Power Gating on most die units Fusion Architecture enables video transcode&image processing 14 of 18

Power Reduction Use of Physical Register files Non-Shifting queues with pointers Clock Gating Integrated Core Power Gating Clocking arrays when needed - Predicting the type of branch then clocking the appropriate predictor Elimination of Instruction marker bits in Icache Speed path polishing in order to raise the Vt mix and reduce leakage 15 of 18

Overview BOBCAT is the CPU engine for AMD s first APU Provides %90 of the today s mainstream notebooks performance in the half area. Highly Portable across designs Sub-one watt capable core 16 of 18

References Brad Burgess, Brad Cohen, Marvin Denman, Jim Dundas, David Kaplan, Jeff Rupley, "Bobcat: AMD's Low- Power x86 Processor," IEEE Micro, vol. 31, no. 2, pp. 16-25, Mar./Apr. 2011, doi: 10.1109/MM.2011.2 Brad Burgess AMD's "Bobcat" x86 Core - Small, Efficient and Strong, Hot Chips 22, August 22-24, 2010, Memorial Auditorium, Stanford University 17 of 18

THANK YOU