Intel Core Microarchitecture

Size: px

Start display at page:

Download "Intel Core Microarchitecture"

Georgina Holt
5 years ago
Views:

Intel Core Microarchitecture Marco Morosini 651191 Matteo Larocca 680089 AY 2005/2006 Multimedia System Architectures

1 Intel Core Microarchitecture Marco Morosini Matteo Larocca AY 2005/2006 Multimedia System Architectures Presentation Outlook New solutions for old problems Architecture Overview Architecture details New features Conclusions 1

2 New solutions for old problems P4 Scalability is achieved through clockspeed increasing Core scalability through multicore approach A CPU 20% slower saves up to 70% power. Two cores at reduced speed are better than fullspeed one. W/cm The Sun Pentium4 561 Prescott Pentium4 1,4 Willamette Pentium 133 Netburst Cons High clockspeed was achieved through NetBurst technology: long pipeline (P4 has 31 stages) means low latency, high clockspeed. It s getting harder and harder to increase clockspeed, Netburst ends here. 2

Core Microarchitecture Core microarchitecture has been designed from scratch to overcome two problems: Excessive power consumption Difficulties in improving performance Microarchitecture 1/5 Core's

3 Core Microarchitecture Core microarchitecture has been designed from scratch to overcome two problems: Excessive power consumption Difficulties in improving performance Microarchitecture 1/5 Core's designers took everything that has already been proven to work and added more of it Violet: fetch stage Orange: decode stage Yellow: reorder stage Blue: execution stage Green: memory access stage. 3

) The micro-ops buffer can receive up to 7 micro-ops per cycle.

4 Microarchitecture 2/5 Front end: fetch + decode 3 simple decoders (1 more WRT P6) transform 1 x86 instruction into 1 micro-op per clock cycle. 1 complex decoder handles the x86 instructions that translate into 2-4 micro-ops. It can output up to 4 micro-ops per clock cycle. Now more instructions can use simple decoders (i.e. SSE, memory access, and so on.) The micro-ops buffer can receive up to 7 micro-ops per cycle. Microarchitecture 3/5 Core represents the current apex of OOOE design, where as much code and data stream optimization as possible is carried out in silicon. ROB has now 96 entry (up from 40) The goal of the increased performances of the front end and of Core new features is to feed up constantly the OOOE hardware and keep the instruction window and execution core full of code. PRF renames register file in order to avoid Write after Read hazard. 4

Microarchitecture 4/5 Enlarged Reservation Station 3 out of 6 issue port for Execution units.

integer ALUs are on separate issue ports, Core can sustain a total throughput of three 64-bit integer operations per

Microarchitecture 5/5 3 issue ports dedicated to memory system operations.

5 Microarchitecture 4/5 Enlarged Reservation Station 3 out of 6 issue port for Execution units bit integer execution units: 1 64-bit complex integer unit (CIU) 2 simple integer units (SIUs) Because the 64-bit integer ALUs are on separate issue ports, Core can sustain a total throughput of three 64-bit integer operations per cycle. 2 floating-point execution units bit SSE execution units. Microarchitecture 5/5 3 issue ports dedicated to memory system operations. Thanks to the presence of the MOB, store data and store address operation can be performed in parallel. L2 Cache (4MB, 16 way) is shared between different Cores. Each Core has its own L1 cache. The L1 Caches can exchange data directly. 5

6 Macrofusion 1/2 It is the ability to fuse certain types of x86 instructions together in the predecode phase and send them through a single decoder to be translated into a single micro-op. Not all the x86 instructions can be Macrofused: tipically, Compare and Test instructions can be macrofused with branch instructions Macrofusion 2/2 Up to 1 macrofusion per cycle, performed by any one of the four decoders. Benefits: Simplifies ROB and RS works, since there are fewer micro-ops in-flight for the core to track. This means also power saving. Increases Core's execution width, because a single ALU can execute what is essentially two x86 instructions simultaneously 6

7 Microfusion 1/2 Micro-ops fusion, has some effects similar to macro-fusion ones, but it functions differently Basically, a simple/fast decoder takes in a single x86 instruction that would normally translate into two micro-ops, and it produces a fused pair of micro-ops that are tracked by the ROB using a single entry. When they reach the reservation station, the two members of this fused pair are allowed to issue separately, either in parallel through two different issue ports or serially through the same port, depending on the situation. Microfusion 2/2 Example: Store Instruction Store instructions are broken down into two uops: store-address uop: calculates the address where the data is to be stored store-data uop: writes the data to be stored into the outgoing store data buffer Because the two operations are inherently parallel and are performed by two separate execution units on two separate issue ports, these two uops can be tracked by ROB as one and executed in parallel. 7

8 Digital media boost 1/2 Before Core, SIMD 128bit instructions have to be broken down into two micro ops. Before Core, datapaths are 64bit wide. Digital media boost 2/2 With 128bit datapaths, Core can handle SIMD instructions with only one micro op. Therefore the new design, not only eliminate the latency disadvantage, but it also improves decode, dispatch, and scheduling bandwidth because half as many micro-ops are generated for 128-bit vector instructions. 8

9 Power saving The Core microarchitecture includes an on-die digital thermal sensor, for safety and reliability features; it is also rumored that it may be used to increase the frequency of the MPU when the sensor determines that there is thermal headroom. Each of the two cores is managed independently, and many entire blocks can be put to sleep, such as the microcode sequencer. Most internal buses are gated for power savings. So if a bus is not sending out a full data load each cycle, part of the bus can be put to sleep. Normally, increasing the IPC of a design means higher power consumption, whether the extra resources are used or not. With the extensive power and clock gating, Core MPU, the designers only pay for what is used, which makes a high IPC design much more attractive. Memory Disambiguation 1/2 Out-of-order processors must first put instructions back in program order before officially writing their results out because you can't modify a memory location until all of the previous instructions that read that location have completed execution. Intel architectures before Core are built around a conservative rules. Load is not allowed to be executed before previous store instructions with an undefined address. 9

10 Memory Disambiguation 2/2 Academic researches demonstrate that over 97% of the memory accesses in a processor's instruction window are about unrelated locations. Core tries to execute out of order store, if something wrongs happens, a roll-back will be done. Conclusions Core looks like it has what it takes to carry Intel forward for at least another five years. It will help the software industry gradually make the transition to multithreaded code. It is an ambitious design that improves on its predecessors in nearly every way, especially the fanatical focus on power efficiency (Intel promises 40% more performance for Conroe at 40% less power compared to Pentium D). 10

11 Bibliography Into the Core: Intel's next-generation microarchitecture by Jon Hannibal Stokes core.ars Intel's Next Generation Microarchitecture Unveiled by David Kanter rticleid=rwt Wikipedia 11

Inside Intel Core Microarchitecture

White Paper Inside Intel Core Microarchitecture Setting New Standards for Energy-Efficient Performance Ofri Wechsler Intel Fellow, Mobility Group Director, Mobility Microprocessor Architecture Intel Corporation