Poulson: An 8 Core 32 nm Next Generation I ntel I tanium. Processor. Steve Undy I ntel I NTEL CONFI DENTI AL

Size: px

Start display at page:

Download "Poulson: An 8 Core 32 nm Next Generation I ntel I tanium. Processor. Steve Undy I ntel I NTEL CONFI DENTI AL"

Tiffany Walton
6 years ago
Views:

1 Poulson: An 8 Core 32 nm Next Generation I ntel I tanium Processor Steve Undy I ntel 1 I NTEL CONFI DENTI AL

2 This slide MUST be used with any slides rem oved from this presentation Legal Disclaim er I NFORMATI ON I N THI S DOCUMENT I S PROVI DED I N CONNECTI ON WI TH I NTEL PRODUCTS. NO LI CENSE, EXPRESS OR I MPLI ED, BY ESTOPPEL OR OTHERWI SE, TO ANY I NTELLECTUAL PROPERTY RI GHTS I S GRANTED BY THI S DOCUMENT. EXCEPT AS PROVI DED I N I NTEL S TERMS AND CONDI TI ONS OF SALE FOR SUCH PRODUCTS, I NTEL ASSUMES NO LI ABI LI TY WHATSOEVER, AND I NTEL DI SCLAI MS ANY EXPRESS OR I MPLI ED WARRANTY, RELATI NG TO SALE AND/ OR USE OF I NTEL PRODUCTS I NCLUDI NG LI ABI LI TY OR WARRANTI ES RELATI NG TO FI TNESS FOR A PARTI CULAR PURPOSE, MERCHANTABI LI TY, OR I NFRI NGEMENT OF ANY PATENT, COPYRI GHT OR OTHER I NTELLECTUAL PROPERTY RI GHT. I NTEL PRODUCTS ARE NOT I NTENDED FOR USE I N MEDI CAL, LI FE SAVI NG, OR LI FE SUSTAI NI NG APPLI CATI ONS. I ntel m ay m ake changes to specifications and product descriptions at any tim e, without notice. Software and workloads used in perform ance tests m ay have been optim ized for perform ance only on I ntel m icroprocessors. Perform ance tests, such as SYSm ark and MobileMark, are m easured using specific com puter system s, com ponents, software, operations and functions. Any change to any of those factors m ay cause the results to vary. You should consult other inform ation and perform ance tests to assist you in fully evaluating your contem plated purchases, including the perform ance of that product when com bined with other products. Configurations: [ describe config + what test used + who did testing]. For m ore inform ation go to http: / / / perform ance I ntel does not control or audit the design or im plem entation of third party benchm arks or Web sites referenced in this docum ent. I ntel encourages all of its custom ers to visit the referenced Web sites or others where sim ilar perform ance benchm arks are reported and confirm whether the referenced benchm arks are accurate and reflect perform ance of system s available for purchase. I ntel processor num bers are not a m easure of perform ance. Processor num bers differentiate features within each processor fam ily, not across different processor fam ilies. See / products/ processor_num ber for details. I ntel, processors, chipsets, and desktop boards m ay contain design defects or errors known as errata, which m ay cause the product to deviate from published specifications. Current characterized errata are available on request. The code nam es Poulson and Tukwila presented in this docum ent are only for use by I ntel to identify products, technologies, or services in developm ent, that have not been m ade com m ercially available to the public, i.e., announced, launched or shipped. They are not "com m ercial" nam es for products or services and are not intended to function as tradem arks. I nt el, I ntel I tanium, I tanium, I ntel Xeon, I ntel Core m icroarchitecture, and the I ntel logo are tradem arks or registered tradem arks of I ntel Corporation or its subsidiaries in the United States and other countries. Copyright 2011 I nt el Corporat ion. All right s reserved. 2

3 Agenda Design Space Overview Parallelism Data I ntegrity Power Conclusion 3

4 Design Space: Mission Critical Perform ance Both single-thread speed and total throughput are im portant Goal: > 2x generat ional perform ance im provem ent Scalabilit y Glueless up to 8 sockets Support larger topologies via node controllers from system vendors Reliabilit y Redundancy and self- correct ion Error detection and recovery via hardware and firm ware Com patibility Socket com patible with I tanium 9300 Processor Series (Tukwila) Com m on platform ingredients with Xeon E7 Processor Series 4

5 Poulson Overview Tukw ila Poulson Process 65nm 32nm Devices 2.05 Billion 3.1 Billion Area 700 m m m m 2 Power ( m ax TDP) 185W 170W I tanium Cores 4 8 Last Level Cache Size 24MB 32MB I ntel QuickPath I nterconnect Links 4 full + 2 half 4 full + 2 half I ntel QPI Link Speed 4.8GT/ s 6.4GT/ s I ntel Scalable Mem ory I nterface 4 4 Links I ntel SMI Link Speed 4.8GT/ s 6.4GT/ s 5

Poulson Overview QPI Core 0 Core 1 QPI Shared Last Level Cache QPI QPI Core 6 Core 7 8 I tanium Cores 32MB Last Level Cache Directory Cache Core 2 ½ QPI Syst em Logic Shared Last Level

6 Poulson Overview QPI Core 0 Core 1 QPI Shared Last Level Cache QPI QPI Core 6 Core 7 8 I tanium Cores 32MB Last Level Cache Directory Cache Core 2 ½ QPI Syst em Logic Shared Last Level Cache Directory Cache Core 5 Core 3 Core 4 SMI SMI ½ QPI 4 full-width and 2 half-width I ntel QPI links 2 Direct ory caches 2 I ntegrated m em ory controllers with 2 I ntel SMI links each 6

7 Exam ple 8 - Socket Topology Full I ntel QPI Mem ory I / O I / O Hub Mem ory Mem ory Hub Mem ory Half I ntel QPI I ntel SMI Poulson Poulson Poulson Poulson Poulson Poulson Poulson Poulson Mem ory I / O Mem ory Mem ory I / O Mem ory Hub Hub 7

8 Poulson Core Tukw ila Poulson 8 Devices 108 m illion 89 m illion Area 69 m m 2 20 m m 2 Process scaled TDP Process scaled frequency 1.0 > 1.2 Threads I nstruction Q size 48 96x2 Max inst ruct ion issue/ cycle 6 12 Pipeline stages Pipeline hazard resolution I nterlock+ Stall Replay I nteger RF size I nteger RF ports 12R 8W 12R 12W RF protection Parity SECDED All- new I tanium core

9 I tanium New I nstructions I nteger operations m py4 m pyshl4 clz Thread Cont rol hint@priority Expanded Data Access Hints m ov dahr Expanded Software Prefetch lfetch.count 9

10 Exploiting Parallelism on All Levels I nst ruct ion Level Parallelism Mem ory Parallelism Thread Parallelism Multi- core Parallelism 10 Multi- level parallelism exposed to softw are control

I nstruction Parallelism Front- end Pipeline Flush Main Pipeline Replay Replay I PG FET FDC REN DEC REG EXE DET WRB WB2 I PG: Address Gen FET: Array Access FDC: Early Decode REN:

Dispersal DEC: I nstr Decode REG: Register Read EXE: Execut e DET: Except ion Det ect WRB: Writ e- back WB2: Com m it Q holds 96 instructions I n-order issue I ndependent thread dom

11 I nstruction Parallelism Front- end Pipeline Flush Main Pipeline Replay Replay I PG FET FDC REN DEC REG EXE DET WRB WB2 I PG: Address Gen FET: Array Access FDC: Early Decode REN: Regist er Renam e Fetch width: 32B/ cycle Mult i- level branch Predict ion Renam ing: 128 logical to 185 physical registers I ndependent thread dom ain I BD/ Q I BD: I nstr Buffer/ Dispersal DEC: I nstr Decode REG: Register Read EXE: Execut e DET: Except ion Det ect WRB: Writ e- back WB2: Com m it Q holds 96 instructions I n-order issue I ndependent thread dom ain OZQ MLD Pipeline Concurrency via 3 decoupled pipelines MLD: Mid-level data cache OZQ: Architectural m em ory ordering queue OZQ holds 16 m em ory ops OOO issue I ndependent thread dom ain 11

12 I nstruction Parallelism Front- end Back- end MLD I PG FET FDC REN 32B 6 I nstructions MQ I Q Mem ory I nteger OZQ Com pilers precisely control instruction dispersal AQ FQ BQ NopQ ALU Float ing Point Branch NOP w ide issue under softw are control 12

13 Mem ory Parallelism Avoiding Hazards Expanded Soft ware Hint s Provides control over allocation, speculation and prefetch policies Multi-line software prefetch Reduced Speculat ion Cost s Spont aneous Deferrals on speculat ive loads Deferred load can transform into a prefetch to cache and/ or TLB New Hardware Prefet cher I nto FLD and/ or MLD Adaptive based on FLD/ MLD m iss patterns Reduced Dat a Hazards Expanded store to consum er bypassing in FLD and MLD Single-cycle MLD to FLD line transfers Pipeline bottlenecks rem oved 13

14 Mem ory Parallelism I ncreasing Throughput More pending m em ory operations in each core I ncreased from 48 to 64 operations in MLD I ncreased from 44 to 80 operations in Ring interface Ring Cache 8 I ndependent LLC pipelines and controllers per socket 2 caching agents per socket to generate I ntel QPI transactions I ncreased I ntel QPI link bandwidth for m ulti- socket throughput 14 Mem ory Subsystem I m provem ents Hom e agent in-flight transaction tracker increased by 50% MC scheduler algorithm changes focused on perform ance and power efficiency I ntel SMI link speed increased I m proved queuing and B/ W

15 Thread Parallelism I ntel Hyper-Threading Technology with Dual Dom ain Multithreading support Front-end and m ain pipelines are independently threaded Pipeline-specific thread switch m echanism s More per thread resources I nstruction queues Data TLBs Hardware page walker handles concurrent walks to both threads I nstructions provide software hints to thread switch hardware Attention to Thread Fairness Hardware t o ident ify unfair behavior Measures instruction retirem ent and data cache resource usage Firm ware configurable responses to unfairness By biasing towards victim thread I ncreased throughput via im proved threading 15

10 port Rout er 128GB/ s QPI x6 I ntel QPI HAx2 MCx2

16 Multi- core and Multi- socket Parallelism 700GB/ s (aggregate) Core 3 LLC LLC Core 4 Core 2 LLC LLC Core 5 10 port Rout er 128GB/ s QPI x6 I ntel QPI HAx2 MCx2 45GB/ s I ntel SMI Core 1 LLC LLC Core 6 Core 0 LLC LLC Core 7 16

17 Reliability, Fault Tolerance and Recoverability I ntel I nstruction Replay Technology Main pipeline redesigned for error handling Enabled hardware and firm ware recovery of correct able errors I ncreased Error Detection and Correction Arrays MLI, MLD, LLC tag and Directory cache SECDED LLC data DECTED Register files (GR and FR) SECDED Logic path error detection in FP ALU via residues Key paths are protected end to end with invalidation and replay Reworked Error Det ect ion, Response and Report ing Fram ework Large increase in error detection and response capabilities I ncreased hardware responses and recovery I ntel Cache Safe Technology: lockout of failing cache locations 17

18 Pow er Aw are Design Replay- based Pipeline Enabled fully gated clocks Elim inated slow and power-wasting global pipeline stalls Aggressive Clock Gat ing Substantial reductions in both idle, typical and m ax power Rem oved Dynam ic Logic from FPU Low- Leakage FETs 18 Digit al Power Predict ion Measures, weights and averages key core state signals Beyond architectural state uses data patterns Estim ation error decreased from 35% to 2% 6 0 % TDP, 7 0 % idle C dyn reduction over process scaled Tukw ila

19 Poulson Status Well into post-si validation Booted and being tested on m ultiple operating system s Running in m any topologies On track for 2012 shipm ents 19

20 Conclusion All new core design 12- wide issue I ntel Hyper-Threading Technology with Dual Dom ain Multithreading I tanium New I nstructions and policies for fine-grain software control of hardware parallelism I ntel I nstruction Replay Technology for frequency, power and fault tolerance Core power efficiency: > 3x better than Tukwila 1 Ring- based design Large last level caches High bandwidth system interface Outstanding data reliability Enhanced error detection I m proved error recovery Better RAS integration 1 I nternal lab m easurem ent 20

22 Glossary DECTED DTB FI T FLB Double Error Correction, Triple Error Detection Second Level Data TLB Failure in Thousands First Level Branch Cache MLD MLI OZQ Mid Level Data Cache Mid Level I nstruction Cache Ordering czar Queue: architectural m em ory ordering QPI I ntel QuickPath I nterface FLD FLI HA I LP First Level Dat a Cache First Level I nstruction Cache Hom e Agent: I ntel QPI agent responsible for m anaging m em ory I nst ruct ion Level Parallelism I PF I tanium Processor Fam ily LLC MC Last Level Cache Mem ory Cont roller RAS RF SECDED Reliability, Availability and Serviceability Register File Single Error Correction, Double Error Detection SMI I ntel Scalable Mem ory I nt erconnect TD P TLB Therm al Design Power Translat ion Look- aside Buffer 22

Transitioning the I ntel Next. ( Nehalem and W estm ere) into the. Lily Looi, St ephan Jourdan I ntel Corporation Hot Chips 21 August 24, 2009

Transitioning the I ntel Next. ( Nehalem and W estm ere) into the. Lily Looi, St ephan Jourdan I ntel Corporation Hot Chips 21 August 24, 2009 Transitioning the I ntel Next Generation Microarchitectures ( Nehalem and W estm ere) into the Mainstream Lily Looi, St ephan Jourdan I ntel Corporation Hot Chips 21 August 24, 2009 1 Agenda Next Generation