New Dimensions in Microarchitecture Harnessing 3D Integration Technologies

New Dimensions in Microarchitecture Harnessing 3D Integration Technologies Kerry Bernstein IBM T.J. Watson Research Center Yorktown Heights, NY 6 March, 27 San Jose, California DARPA Microsystems Technology Symposium Escher Envy courtesy of David Bryant

Report Documentation Page Form Approved OMB No. 74-188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 124, Arlington VA 2222-432. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 6 MAR 27 2. REPORT TYPE N/A 3. DATES COVERED - 4. TITLE AND SUBTITLE New Dimensions in Microarchitecture Harnessing 3D Integration Technologies 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) IBM Corporation 8. PERFORMING ORGANIZATION REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 1. SPONSOR/MONITOR S ACRONYM(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release, distribution unlimited 11. SPONSOR/MONITOR S REPORT NUMBER(S) 13. SUPPLEMENTARY NOTES DARPA Microsystems Technology Symposium held in San Jose, California on March 5-7, 27. Presentations, The original document contains color images. 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT UU a. REPORT unclassified b. ABSTRACT unclassified c. THIS PAGE unclassified 18. NUMBER OF PAGES 12 19a. NAME OF RESPONSIBLE PERSON Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

IBM Research Server Trends Frequency no longer increasing Logic speed scaled faster than memory bus (Processor clocks / Bus clock) consumes bandwidth More speculation; attempts to prefetch Wrong guesses increase miss traffic Shortening linesize limited by directory as cache size grows But doubling linesize doubles bus occupancy Cores / die increasing each generation Multiplies off-chip bus transactions by N / 2*Sqrt(2) More threads per core, and increase in virtualization Multiplies off-chip bus transactions by N Processors / SMP increasing Aggravates queueing throughout the system 2 March 6, 27 25 IBM Corporation

Without more bandwidth at low latencies, core counts on chip will saturate Max Core Count In 3D? Performance (A.U.) Number of Cores per Processor System Perf 27 Max Core Count In 2D Device Technology Performance Interconnect Technology Performance 2D Core Bandwidth Limit From New Dimensions in Microarchitecture K. Bernstein, MICRO-39, 12/6 Year 6 March, 27 New Dimensions in Microarchitecture 3D extends transfer of performance from the device to the core level 3

Chip-Package Technology Gap Design rule - pitch(µm) 1 1 1.1.1 199 Package/Flip-chip chip Wire bond 25 ITRS Flip chip 25 ITRS Wire bond 21 ITRS Flip chip 21 ITRS Multi-level level wiring 2 Technology Gap Global (minimum) ITRS 25 ITRS 21 Intermediate ITRS 25 ITRS 21 Local (metal 1 wiring) ITRS 25 ITRS 21 21 Comparison of of trends adapted from from 21 21 ITRS ITRS road road map map to to data data points from 25 ITRS road map Technology gap in the design rule between on-chip wiring and packaging interconnects 4 DARPA MTS March 6, 27 26 IBM Corporation

POWER Series FP Performance POWER Series Architectural Perf Contributions 4 Bandwidth 3 SpecFP Thousands 2 1 Superscalar Clock Gating Speculative Execution Branch Prediction. Multi-core Processor Deep Pipelining Out-of-Order Execution Register Renaming 64 Bit Dataflow POWER3 POWER4 POWER4+ POWER5 POWER5+ Processor Perf Data Delivery Transaction Rate Dependence 6 March, 27 New Dimensions in Microarchitecture Symmetric Multi-threading P5 5

Components of Processor Performance From ISCA 6 Keynote address by Phil Emma, IBM Finite Cache Effect Inf. Cache Perf. L1 Size limited by Cycle time I Unit etc. L2 Cache L1 Cache E Not Busy E Busy E Unit Cache, Memory Hierarchy Uniprocessor Core CPI MIPS = f / CPI Slope = Miss Penalty (Cycles Per Miss) Inf. Cache Perf. E Busy Miss Rate Finite Cache Effect E Not Busy Delay is sequentially determined by a) ideal processor, b) access to local cache, and c) refill of cache 6 March, 27 New Dimensions in Microarchitecture 6

1 From ISCA 6 Keynote address by Phil Emma, IBM Queueing Effects vs. Log Miss Rate.8 Relative Performance &.6 Bus Utilization CPI = 3.63 TE=128 TE=32 TE=8 TE=2.4.2 TE=128 TE=32 TE=8 TE=2 1 9 8 7 6 5 4 3 2 1-1 -2-3 Log (Instructions per Miss) 2 = Relative Performance, = Bus Utilization 6 March, 27 New Dimensions in Microarchitecture 7

What Is Bandwidth Used For? In a computer, it is mostly for handling cache misses: From ISCA 6 Keynote address by Phil Emma, IBM Bus Events Miss Leading Edge Access First Data Processor Events Rest of Cache Line Trailing Edge Last Data Time Miss Penalty = Leading Edge + Effects(Trailing Edge) 6 March, 27 New Dimensions in Microarchitecture 8

3D - Bandwidth and Latency From New Dimensions in Microarchitecture K. Bernstein, MICRO-39, 12/6 Processor load trade-off between I/O Bandwidth, Bus Latency. - For generic workloads, uni-processor perf saturates bandwidth benefit, becomes latency-limited. - As core counts increase, I/O Bandwidth becomes increasingly important Arch Perf, TpCC (Norm) 4 3 2 1 Bandwidth and Latency Boundaries General Purpose Processor Loads Band width limited Latency limited 1 2 3 4 Single Core Double Core Quad Core Bandwidth, GB/Sec (Norm) 3D opportunity for improving High Perf Compute thruput in sustaining a higher number of cores per chip 6 March, 27 New Dimensions in Microarchitecture 9

IBM Research Low Vdd Technology and Parallelism Energy optimum for fixed performance as function of V dd, V T and effectiveness of parallelism α determines the device (circuit) over head to maintain constant performance through parallelism α =1 no overhead: half the speed double the devices α > 1 increasing overhead: passive power becomes dominant Total Energy / Transaction 1.8.6.4.2 Energy per Transaction at Constant Performance 2 Stages, Activity=1% α = 2 α = 1.2.2.4.6.8 1 Vdd α = 1 P ~ P d d N N ( ) 1 α N=number of ckts, d=ckt delay N o = number of ckts at 1V d o = ckt delay at 1V From 3D Intergration Special Topic Sessionl W. Haensch, ISSCC 7, 2/7 1 March 6, 27 25 IBM Corporation

Two Classes of 3DI Processes at IBM ~.2 um ~ 6 um ~ 1.4-2. um SOI wafer ~ 5 um Bulk wafer SOI-3D Bulk-3D SOI top layer Advantage: Smallest 3D vias Bulk top layer Advantage: Broader foundry compatibility 11 DARPA MTS March 6, 27 27 IBM Corporation

Summary λp architecture tricks to avoid atomistic, QM scaling boundaries overwhelm present interconnect technologies Integration into Z-plane again postpones interconnectrelated limitations to extending classic scaling. No aspect of architecture or technology remains 2D, so why even view chips as being monolithic anymore? Transaction retirement rate dependence on data delivery is increasing: dependence on λp performance and CMOS device speed is decreasing 3D Integration improves storage density and access to that storage The last remaining opportunity in CMOS to save power is in delivery of data rather than in its generation. 6 March, 27 New Dimensions in Microarchitecture 12