Concurrent High Performance Processor design: From Logic to PD in Parallel

Size: px

Start display at page:

Download "Concurrent High Performance Processor design: From Logic to PD in Parallel"

Solomon Dennis
6 years ago
Views:

1 IBM Systems Group Concurrent High Performance design: From Logic to PD in Parallel Leon Stok, VP EDA, IBM Systems Group

2 Mainframes process 30 billion business transactions per day The mainframe is everywhere, making the world work better Mainframes enable $6 trillion in card payments annually 80 percent of the world s corporate data resides or originates on mainframes 9 percent of CIOs said new customerfacing apps are accessing the mainframe 207 IBM Corporation

IBM Z Roadmap 4 nm 3 65 nm z0 2/2008 Workload Consolidation and Integration Engine for CPU

Tier Single Thread Performance,System Capacity Integration Out of Order Execution Water

Thread, Enhanced Throughput Improved out-of-order Transactional Dynamic Optimization 2 GB

Performance Modularity & Scalability Dynamic SMT Supports two instruction threads SIMD PCIe

I/O for acceleration of transaction processing for DB2 on z/os Pause less garbage collection

3 IBM Z Roadmap 4 nm 3 65 nm z0 2/2008 Workload Consolidation and Integration Engine for CPU Intensive Workloads Decimal FP Infiniband 64 CP Image Large Pages Shared 45 nm z96 9/200 Top Tier Single Thread Performance,System Capacity Integration Out of Order Execution Water Cooling PCIe I/O Fabric RAIM Enhanced Energy Management 32 nm zec2 9/202 Leadership Single Thread, Enhanced Throughput Improved out-of-order Transactional Dynamic Optimization 2 GB page support Step Function in System Capacity 22 nm z3 3/205 Leadership System Capacity and Performance Modularity & Scalability Dynamic SMT Supports two instruction threads SIMD PCIe attached accelerators Business Analytics Optimized z4 9/207 Pervasive encryption Low latency I/O for acceleration of transaction processing for DB2 on z/os Pause less garbage collection for enterprise scale JAVA applications New SIMD instructions Optimized pipeline and enhanced SMT Virtual Flash

4 z4 processor design summary Micro-Architecture 0 cores per CP-chip 5.2GHz Cache Improvements: 28KB I$ + 28KB D$ 2x larger L2 D$ (4MB) 2x larger L3 Cache symbol ECC New translation & TLB design Logical-tagged L directory Pipelined 2 nd level TLB Multiple translation engines Architecture PauseLess Garbage Collection Vector Single & Quad precision Long-multiply support (RSA, ECC) Register-to-register BCD arithmetic s Redesigned in-core crypto-accelerator Improved performance New functions (GCM, TRNG, SHA3) Optimized in-core compression accelerator Improved start/stop latency Huffman encoding for better compression ratio Order-preserving compression Better Branch Prediction 33% Larger BTB & BTB2 New Perceptron & Simple Call/Return Predictor Pipeline Optimizations Improved instruction delivery Faster branch wakeup Improved store hazard avoidance 2x double-precision FPU bandwidth Optimized 2 nd generation SMT2 4

5 shrinkage in 4nm 33% area reduction Timing within ~-5ps range (FOM s ~-2500) ~40% less logic gate width, ~20% less total gate width At least as good LVT width, some versions show improvement to significant improvement 5

6 Why was this so difficult? Logic designers from Venus, PD designers from Mars Logical Organization Preference Verification Focus Logic Ownership Functional Adjacency Physical Organization Preference Implementation Focus Physical Optimization Geographic Adjacency Combined Single Hierarchy Iterative PD Annotation High Coordination Effort C C2 C3 C4 B B2 B3 B4 Less Efficient Design Quality C C2 C3 B2 B3 B4 Performance A A2 A3 Power Area C4 A B A2 A3 6 6

7 Logic Designers View An obvious benefit is to create a multi-core chiplet Move processor cores and bus interface logic into their respective multi-core chiplet instances Create a multi-core chiplet entity and instantiate it multiple times Multi-core Chiplet Multi-core Chiplet On-chip Peripheral Peripheral Multi-core Chiplet On-chip Bus/Interconnect Multi-core Chiplet 7 [Alvan Ng, Automated Physical Hierarchy Generation: Tools and Methodology, DVCon208]

8 Create Integration Chiplets For Manageability North Chiplet The physical blocks are reshaped to fit into the physical chiplets A North and South chiplets and a Bus chiplet are good choices On-chip On-chip Bus Bus/Interconnect Chiplet On-chip Peripheral Peripheral 8 South Chiplet On-chip Bus/Interconnect Create the chiplets entities and move the selected logic into their instances

9 Multi-core Chip Physical Floorplan On-chip Bus/Interconnect On-chip Quad-core chiplet instantiated 4 times Center stripe bus chiplet with 2x high speed link, small accelerator, and the on-chip controller Top chiplet contains memory controller, 2 small accelerators, and 2 medium accelerators Bottom chiplet contains memory controller and 2 large accelerators Stack the rest of circuitries in the open spaces at the top 9

10 Quad-core chiplet instantiated 4 times Center stripe bus chiplet contains 2x -Peripheral combined unit, 3x small accelerator, and the on-chip controller Alternative Chip Physical Floorplan One accelerator chiplet instantiated twice which contains a large and a medium accelerator Stack the High-Speed Links on the right On-chip Bus/Interconnect MEM/IO On-chip MEM/IO 0

11 Morph: RTL to RTL morphing Logical Hierarchy C C2 C3 C4 B B2 B3 B4 Recipe Files Morph- Hier Physical Hierarchy C C2 C3 B2 B3 B4 A A2 A3 Hierarchy Mapping Database Equivalency Checking C4 A B A2 A3 Recipe: Instance move Port optimization Pin Cloning Subway Creation Scheduler Statement reordering for consistency

12 IoT Design Automation Tools Aspect Oriented Design Significant design content exists to support non-mainline functionality. This impacts the ability to readily reuse design IP and hinders productivity by forcing designers to include such concerns while implementing core functionality Need a design system that fully separates the insertion of non-mainline aspects from the core functional description Test Scan BIST SCOM Test Points RAS Error Detection Correction Recovery Trace & Debug... Power Management Clock Gating Power Gating Fencing Sensors Dynamic Control Functional Description Full Design Content Content Weaver Design Automation in the Era of AI and IoT, Arvind Krishna, IEEE/ACM DATE Conference, March 28, 207

13 Morph: RTL to RTL morphing Aspects Logical Hierarchy C C2 C3 C4 B B2 B3 B4 Recipe Files Morph- Hier Physical Hierarchy C C2 C3 B2 B3 B4 A A2 A3 Hierarchy Mapping Database Equivalency Checking C4 A B A2 A3 Recipe: Instance move Port optimization Pin Cloning Subway Creation Scheduler Statement reordering for consistency 3

14 Peripheral Pervasive Logic Centralized VHDL Organization On-chip Bus/Interconnect 4 On-chip Logic Test Logic Miscellaneous Circuitries

15 Distribute Pervasive Logic Using Morph-Hier The Each pervasive Pervasive red dot graphically logic unit contains are push map all into the a the supporting physical logic entities for pervasive each using functional boundary Morph-Hier unit Peripheral On-chip On-chip Bus/Interconnect On-chip Bus/Interconnect 5

16 Centralized Pervasive Logic Distributed To Physical Units Benefits: Parallel logic design r r r r Concurrent with functional units Verification Speedup r r r r Self contained unit On-chip On-chip Bus/Interconnect Design quality Lower bug rate r r r r r r r r 6

17 z4 Pipeline Deep high frequency pipeline Async branch prediction ahead of ifetch 32B/cycle ifetch 6 instruction / cycle parse & decode CISC instruction cracking Unified OOO issue queue 2 LSU, 4-cycle load-use 4 FXU, 2 SIMD/FP/BCD In-order completion & checkpoint 7

18 Physical constraints on the pipeline r22 r2 h2 7 L2 4 r3 h3 RLM r L3 h LBS L L4 Chiplet C 8

19 PD micro-architect allotment r22 3 r2 h2 L2 3 r3 RLM r h3 L3 h LBS L L4 Chiplet C 9

20 Sequential Buffering r2 r22 h2 L2 r3 RLM r h3 L3 h LBS L L4 Chiplet C 2 20

21 Conclusions Most innovation in micro-processors is nowadays coming from Architecture, micro-architecture and accelerators Physical design optimization at micro-architectural level In place of Moore s law technology progress and Fixed block level PPA optimization. This is leading significantly more new Logic being designed and modified, concurrently with the Physical Design Concurrent design of Logic and PD leads to interesting new problems to be explored with significantly larger potential pay-off due the micro-architectural / PD cooptimization design space. 2

Eric Schwarz. IBM Accelerators. July 11, IBM Corporation

Eric Schwarz. IBM Accelerators. July 11, IBM Corporation Eric Schwarz IBM Accelerators July 11, 2016 2016 IBM Corporation Outline Roadmaps of Z and Power Arithmetic Feature Comparison How to Get Performance without Frequency 2 2016 IBM Corporation z Systems