On-Die Interconnects for next generation CMPs

Size: px

Start display at page:

Download "On-Die Interconnects for next generation CMPs"

Christine Montgomery
5 years ago
Views:

1 On-Die Interconnects for next generation CMPs Partha Kundu Corporate Technology Group (MTL) Intel Corporation OCIN Workshop, Stanford University December 6,

highest-throughput and most eco-responsible processor ever created.

2 Multi- Transition Accelerating We notified customers we're pulling in both the desktop and server (launch) of the first quad-core processors into the fourth quarter of this year from the first half of 2007 The UltraSPARC T1 processor with CoolThreads technology is the highest-throughput and most eco-responsible processor ever created. Azul has been able to pack an industry-leading 24 processor cores on a single-chip, which means that each processor is able to run 24 simultaneous parallel threads *Third party marks and brands are the property of their respective owners 2

3 What will we do with this Compute Power? Recognition Mining Synthesis Emerging Killer Applications The RMS Suite Source : Cool Codes for Hot Chips Keynote by Justin Rattner, CTO, Intel, Aug

4 4 Tera-Scale Prototype Scalable One-Die Fabric Fixed Function Units Last Level Cache High BW Memory I/F Source : Cool Codes for Hot Chips Keynote by Justin Rattner, CTO, Intel, Aug. 2006

5 5 Overview of Talk Establish Importance of On-die Interconnects Walk through Case Study of a router design Evaluate against Goals Conclusions

6 irms Data Size estimates ShotDetect Videomining Miss Ratio (% of mem accesses) Primarily running at off-die B/W On-chip caching is effective for these apps PageRank Structure Learning (SNP) MultiDocument Summary Frequent ItemSet Mining ADAt FB_Estimation SparseMVM SparseMVM_sym SparseMVM_trans Dense_mmm Dense_mvm Dense_mvm_sym SVM_RFE (new) IPM BodyTracker 32 PCG Cache Size (MB) Solids Springs Gauss-Seidel * Data collected on complete application run on a hardware cache emulator 6

7 7 CPU Private Cache CPU Private Cache CPU Private Cache CPU Private Cache No data replication All data goes over on-die interconnect Possible data replication primarily dirty blocks go over on-die interconnect High On-Die B/W Low off-die B/W Low On-Die B/W High off-die B/W

8 sharing exists in some of the RMS kernels High On-Die b/w Low off-die b/w Low On-Die b/w High off-die b/w On-Die Bandwidth binomial som svd gauss pcg mmm svm kmeans Shared Private Manage Off-Die bandwidth via better On-Die Network Off-Die Bandwidth binomial som svd gauss pcg mmm svm kmeans 8

9 9 Need for Scalability Flow Ctrl & Error, 11% Protocol, 15% Data, 74% Bandwidth Components Bandwidth Growth over time Data grows with cores Protocol grows faster than cores Error growing due to process

10 Need for Scalability 2D Mesh For CMP Flow Ctrl & Error, 11% Protocol, 15% Data, 74% Bandwidth Components Bandwidth Growth over time Data grows with cores Protocol grows faster than cores Error

10 10 Need for Scalability 2D Mesh For CMP Flow Ctrl & Error, 11% Protocol, 15% Data, 74% Bandwidth Components Bandwidth Growth over time Data grows with cores Protocol grows faster than cores Error growing due to process Need scalable network Network Parameters Size 6x6 mesh Link Sizing 16B, >3Ghz Traffic Classes Request, response, data Data Block Size 64 Bytes Switching & Flow Control Wormhole w/vc flow control Error Control end-to-end

11 11 Overview Case of Talk Study of a Router

12 5-port Switch (overview) Power Breakdown Clock Buffer 16% arb 3% Crossbar 35% Buffers 46% Router Area Design/uArchitecture Goals: Reduce Crossbar area (and power) Reduce Buffer power Maximize throughput of network Misc 31% Buffers 15% crossbar 54% 12

13 Double-pumped Crossbar Source : Vangal et al A six-port 57GB/s double pumped nonblocking router core Sym. On VLSI Circuits, June 2005 Channel Width Channel Area Channel Power Channel Delay Potential Reduction 50% 25% 17% 17% 13

14 14 Buffer Management Fraction of Network Capacity 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Addition of Flit Buffers

15 15 Buffer Management Fraction of Network Capacity 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Addition of Flit Buffers Statically Assigned Buffers SAMQ with simple (VCT) flow control

16 Buffer Management Fraction of Network Capacity 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Addition of Flit Buffers VCi Block Info VCo Block Info Ptr0 Ptr1 Ptr2 Ptr3 Ptr4 Ptr0 Ptr1 Ptr2 Ptr3 Ptr4 Header Control Block (Packet Tracker) Dynamically Assigned Buffers DAMQ-WormHole with Virtual Channel Flow Control F0 F1 PayLoad Buffer read F2 F3 F4 80 Statically Assigned Buffers (SAMQ with VCT flow control) Achieve High low(er) power/area Flit Buffers/Input Port SAMQ with VCT flow control DAMQ - Wormhole/ VC 16

17 Switch Allocator Latency (cycles) Load as fraction of capacity PIM1 SPAA SPARO Perfect (Ford-Fulkerson) Need to generate 4 requests per cycle Adapts to load conditions using heuristic Achieve High manageable latency Proprietary Switch Allocator achieves high matching efficiency 17

18 18 Pipeline Design Fraction of cycle time 120% 100% 80% 60% 40% 20% 4-stage pipeline Buffer Read not in parallel with Switch Arbitration Crossbar traversal sets the cycle time 0% Crossbar Traversal Buffer Read Switch Alloc Request Setup

19 19 Pipeline Design Base Pipeline Choose Pipeline frequency to Maximize Switching rate Optimize for load conditions Request Set Up Crossbar Traversal

20 20 Power Challenges for ODI Router + link power 20% Dense Compute Unit 80% Router + link power 36% 256KB Cache 64% Interconnect Power Currently Exceeding budget! 8 units of power overhead per unit of bit transferred router power 82% links power18%

21 21 Miscellaneous Issues Increased Soft Error and Process Variability impacts design design to detect and/or correct errors (latency, bandwidth impact) routing for fault tolerance Clocking power is high (16%) With wide links cost of GALS approaches may be higher

22 22 Conclusions Scalable High Performance on-die interconnect would be required in future CMPs We do achieve high network throughput Many of the techniques are borrowed from previous research But significant challenge is to fit within power and area

23 23 Acknowledgments Co-Leads : Jay Jayasimha, Yatin Hoskote Aniruddha Vaidya, Sriram Vangal, Arvind P. Singh, Chris Hughes, Y-K Y K Chen, Ioannis Schoinas, Akhilesh Kumar, Sailesh Kottapalli, Jeffrey Chamberlain, Li-Shiuan Peh, Amit Kumar, Niraj Jha

Low-Power Interconnection Networks

Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors: