M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

Similar documents
1 Copyright 2013 Oracle and/or its affiliates. All rights reserved.

Power Technology For a Smarter Future

OPENSPARC T1 OVERVIEW

Oracle Database Exadata Cloud Service Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE

POWER7: IBM's Next Generation Server Processor

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Portland State University ECE 588/688. Cray-1 and Cray T3E

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Open Innovation with Power8

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

EECS 570 Final Exam - SOLUTIONS Winter 2015

Introducing the Cray XMT. Petr Konecny May 4 th 2007

POWER7: IBM's Next Generation Server Processor

Niagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems

Understanding Oracle RAC ( ) Internals: The Cache Fusion Edition

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Oracle Exadata: Strategy and Roadmap

Oracle Database 18c and Autonomous Database

Virtual Memory. CS 351: Systems Programming Michael Saelee

GPU Fundamentals Jeff Larkin November 14, 2016

Memory-Based Cloud Architectures

Solaris Engineered Systems

End-to-End Java Security Performance Enhancements for Oracle SPARC Servers Performance engineering for a revenue product

A 400Gbps Multi-Core Network Processor

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

POWER9 Announcement. Martin Bušek IBM Server Solution Sales Specialist

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

SGI UV for SAP HANA. Scale-up, Single-node Architecture Enables Real-time Operations at Extreme Scale and Lower TCO

SGI UV 300RL for Oracle Database In-Memory

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. reserved. Insert Information Protection Policy Classification from Slide 8

IBM B2B INTEGRATOR BENCHMARKING IN THE SOFTLAYER ENVIRONMENT

Exadata Implementation Strategy

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SAP HANA IBM x3850 X6

Chapter 18 Parallel Processing

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

1. Memory technology & Hierarchy

Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems

Portland State University ECE 588/688. Graphics Processors

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Vector Engine Processor of SX-Aurora TSUBASA

Hybrid Memory Cube (HMC)

Flexible Architecture Research Machine (FARM)

Fundamental CUDA Optimization. NVIDIA Corporation

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

CLOUD-SCALE FILE SYSTEMS

Addressing the Memory Wall

Storage Optimization with Oracle Database 11g

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Toward a Memory-centric Architecture

Tile Processor (TILEPro64)

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

The Google File System

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Modern and Fast: A New Wave of Database and Java in the Cloud. Joost Pronk Van Hoogeveen Lead Product Manager, Oracle

Lecture 24: Memory, VM, Multiproc

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Kaisen Lin and Michael Conley

Oracle Database In-Memory What s New and What s Coming

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Recent Innovations in Data Storage Technologies Dr Roger MacNicol Software Architect

Continuum Computer Architecture

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

KiloCore: A 32 nm 1000-Processor Array

Mix n Match Async and Group Replication for Advanced Replication Setups. Pedro Gomes Software Engineer

Parallel Computing: Parallel Architectures Jin, Hai

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Best Practices for Setting BIOS Parameters for Performance

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 8: Memory-Management Strategies

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Copyright 2012, Elsevier Inc. All rights reserved.

Adapted from David Patterson s slides on graduate computer architecture

Fundamental CUDA Optimization. NVIDIA Corporation

PowerPC 740 and 750

Cortex-A9 MPCore Software Development

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

Virtual Memory. Virtual Memory

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Multithreaded Processors. Department of Electrical Engineering Stanford University

Review: Creating a Parallel Program. Programming for Performance

Copyright 2014, Oracle and/or its affiliates. All rights reserved.

The Google File System

Fujitsu High Performance CPU for the Post-K Computer

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Transcription:

M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle

Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. 2

SPARC @ Oracle 5 Processors in 4 Years 2010 2011 2012 2012 2013 SPARC T3 16 S2 cores 4MB 40 nm technology 1.65 GHz SPARC T4 8 S3 Cores 4MB 40nm Technology 3.0 GHz SPARC T5 16 S3 Cores 8MB 28nm Technology 3.6 GHz SPARC M5 6 S3 Cores 48MB L3 $ 28nm Technology 3.6 GHz SPARC M6 12 S3 Cores 48MB 28nm Technology 3.6 GHz 3

Oracle s SPARC Strategy Extreme Performance Computing Efficiency Optimized for Oracle Software Best Performance and Price- Performance in the Industry Deliver Services with Faster Speeds, and Lower the Costs for Existing Services Flexible Virtualization, Designed for Availability Enable Higher Utilization of Capital Assets, and Reduce Risks to Availability of Data Hardware and Software Engineered, Tested, and Supported Together Deploy Technology Faster, with Less Risk and Lower Costs http://blog.oracle.com/bestperf 4

ACCELERATORS MEMORY CONTROL M7 Processor ACCELERATORS MEMORY CONTROL Extreme Performance COHERENCE, SMP & I/O INTERCONNECT & ON-CHIP NETWORK COHERENCE, SMP & I/O INTERCONNECT 32 SPARC Cores Fourth Generation CMT Core (S4) Dynamically Threaded, 1 to 8 Threads Per Core New Cache Organizations Shared Level 2 Data and Instruction Caches 64MB Shared & Partitioned Level 3 Cache DDR4 DRAM Up to 2TB Physical Memory per Processor 2X-3X Memory Bandwidth over Prior Generations PCIe Gen3 Support Application Acceleration Real-time Application Data Integrity Concurrent Memory Migration and VA Masking DB Query Offload Engines SMP Scalability from 1 to 32 Processors Coherent Memory Clusters Technology: 20nm, 13ML 5

M7 Core (S4) Extreme Performance Dynamically Threaded, 1 to 8 Threads Increased Frequency at Same Pipeline Depths Dual-Issue, OOO Execution Core 2 ALU's, 1 LSU, 1 FGU, 1 BRU, 1 SPU 40 Entry Pick Queue 64 Entry FA I-TLB, 128 Entry FA D-TLB Cryptographic Performance Improvements 54bit VA, 50bit RA/PA, 4X Addressing Increase Fine-grain Power Estimator Live Migration Performance Improvements Core Recovery Application Acceleration Support Application Data Integrity Virtual Address Masking User-Level Synchronization Instructions 6

M7 Core Cluster L2-D$ L2-I$ L2-D$ 4 SPARC Cores Per Cluster New L2$ Architecture 1.5X Larger at Same Cycle Count Latency as Prior Generations 2X Bandwidth per Core, >1TB/s Per Core Cluster, >8TB/s per Chip 256KB L2-I$ Shared by All Cores in Core Cluster 4-way SA, 64B Lines, >500GB/s Throughput 4 Independent Core Interfaces @ >128GB/s Each 256KB Writeback L2-D$ Each L2-D$ Shared by 2 Cores (Core-pair) 8-way SA, 64B Lines, >500GB/s Throughput per L2-D$ 2 Independent Core Interfaces @ >128GB/s per L2-D$ Core-pair Dynamic Performance Optimizations Doubles Core Execution Bandwidth at Prior Generation Thread Count to Maximize Per-Thread Performance Utilize 8 Threads Per Core to Maximize Core Throughput 7

MEMORY & ACCELERATORS M7 Level 3 Cache and On-Chip Network MEMORY & ACCELERATORS and OCN Data Bandwidth (GB/s) 256 256 140 70 140 70 64 64 64 64 ON-CHIP NETWORK 140 70 140 70 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 140 70 140 70 SMP & I/O SMP & I/O 140 70 140 70 256 256 64MB Shared & Partitioned 8MB Local Partitions, Each 8-way SA >25% Reduction in Local Cycle Latency Compared to Prior Generation (M6) >1.6TB/s Bandwidth per Chip, 2.5x (T5) to 5X (M6) Over Previous Generations Cache Lines May be Replicated, Migrated or Victimized Between Partitions HW Accelerators May Directly Allocate into Target Partitions On-Chip Network (OCN) Consists of Request (Rings), Response (Pt-Pt) and Data (Mesh) Interconnects 0.5TB/s Bisection Data Bandwidth 8

M7 Process Group & Virtualization Affinity Solaris Process Group Hierarchy Thread Scheduling Level Corresponding to Core Cluster and Local Partition Thread Load Balance Across Partitions Thread Reschedule Affinity Co-location of Threads Sharing Data ON-CHIP NETWORK Oracle Virtualization Manager (OVM) Process Group Provisioning Applied to Logical Partitions Minimize Thrashing Between Virtual Machine Instances, Improve QoS 9

M7 Fine-grain Power Management ON-CHIP NETWORK On-die Power Estimator Per Core Generates Dynamic Power Estimates By Tracking Internal Core Activities Estimates Updated at 250 Nanosecond Intervals On-die Power Controller Estimates Total Power of Cores and Caches on a Quadrant Basis (2 Core Clusters + 2 Partitions) Accurate to within a Few Percent of Measured Power Individually Adjusts Voltage and/or Frequency within Each Quadrant Based on Software Defined Policies Performance @ Power Optimizations Highly Responsive to Workload Temporal Dynamics Can Account for Workload Non-uniformity Between Quadrants Quadrants May be Individually Power Gated 10

M7 Memory and I/O PCIe Gen3 Interfaces 4 DDR4 Memory Controllers DDR4 Interfaces DDR4 Interfaces Memory Links to Buffer Chips 12.8Gbps/14.4Gbps/16Gbps Link Rates Lane Failover with Full CRC Protection Speculative Memory Read 16 DDR4-2133/2400/2667 Channels Very Large Memory, Up to 2TB per Processor 160GB/s (DDR4-2133) Measured Memory Bandwidth (2X to 3X Previous Generations, T5 and M6) DIMM Retirement Without System Stoppage Reduces Local Memory Latency by Prefetching on Local Partition Miss Dynamic per Request, Based on History (Data, Instruction) and Threshold Settings PCIe Gen3 4 Internal Links Supporting >75GB/s >2X Previous Generations, T5 and M6 11

M7 Processor Performance 4 3.5 3 2.5 2 1.5 1 0.5 0 M6 Baseline Memory BW Int Throughput OLTP Java ERP FP Throughput 12

M7 Application Data Integrity M7 Memory & Caches version version version version version version version version Version Metadata 64Bytes 64Bytes 64Bytes 64Bytes 64Bytes 64Bytes 64Bytes 64Bytes Memory Data M7 Core Pipeline Execution ld st version address Version Miscompare ld st version Reference Versions address Safeguards Against Invalid/Stale References and Buffer Overruns for Solaris and DB Clients Real-time Data Integrity Checking in Test & Production Environments Version Metadata Associated with 64Byte Aligned Memory Data Metadata Stored in Memory, Maintained Throughout the Cache Hierarchy and All Interconnects Memory Version Metadata Checked Against Reference Version by Core Load/Store Units HW Implementation, Very Low Overhead Enables Applications to Inspect Faulting References, Diagnose and Take Appropriate Recovery Actions 13

M7 Concurrent Fine-grain Memory Migration Concurrent Thread Accesses Concurrent Thread Accesses Old Memory Regions Relocating Objects Relocating Objects New Memory Region Enables Concurrent and Continuous Operation Hardware Support For Fine Grain Access Control Applicable to Fine-grain Objects and Large Memory Pages Bypasses Operating System Page Protection Overheads Scales with Threads and Memory Bandwidth Deterministic Memory Access Conflict Resolution Memory Metadata of Relocating Objects Are Marked for Migration User Level Trap on Detection of Memory Reference with Migrating Version 14

M7 Virtual Address (VA) Masking Masked by Effective Address Hardware Memory Metadata (ADI, Migration) Embedded Object Metadata Object Virtual Address Pointer (54, 48, 40, 32 Bits) 64-Bit Virtual Address Allow Programs to Embed Metadata in Upper Unused Bits of Virtual Address Pointers Applications Using 64-bit Pointers Can Set Aside 8, 16, 24 or 32 Bits Addressing Hardware Ignores Metadata Enables Managed Runtimes (e.g. JVM's) to Embed Metadata for Tracking Object Information Caches Object State Table Information into Object Pointer (Pointer Coloring) Eliminates De-reference and Memory Load from Critical Path 15

M7 Database In-Memory Query Accelerator MEMORY or Column Format Row Format Compressed DB DB M7 In-Silicon Query Engines MEMORY or Bit/Byte-packed, Padded, Indexed Vectors Up to 32 Concurrent DB Streams Up to 32 Concurrent Result Streams Hardware Accelerator Optimized for Oracle Database In-Memory Task Level Accelerator that Operates on In-Memory Columnar Vectors Operates on Decompressed and Compressed Columnar Formats Query Engine Functions In-Memory Format Conversions Value and Range Comparisons Set Membership Lookups Fused Decompression + Query Functions Further Reduce Task Overhead, Core Processing Cycles and Memory Bandwidth per Query 16

M7 Query Accelerator Engine Local SRAM M7 Query Engine (1 of 8) On-Chip Network Data Input Queues Data Input Queues Decompress Decompress Decompress Decompress Unpack/ Alignment Unpack/ Alignment Unpack/ Alignment Unpack/ Alignment Predicate Eval Predicate Eval Predicate Eval Predicate Eval Result Format/Encode Result Format/Encode Result Format/Encode Result Format/Encode Data Output Queues Data Output Queues On-Chip Network Eight In-Silicon Offload Engines Cores/Threads Operate Synchronous or Asynchronous to Offload Engines User Level Synchronization Through Shared Memory High Performance at Low Power 17

M7 Accelerator Fine-grain Synchronization Prepare Inputs Initiating Call Core Thread HW Accelerator Time Read Results Post Completion Status Core Thread Initiates a Query Plan Task to Offload Engine User-Level LDMONITOR, MWAIT Halts Hardware Thread for Specified Duration Thread is Re-activated Once Duration Expires or Monitored Memory Location is Updated Offload Engine Completion Results Written Back to Memory or Target Partition Completion Status Posted to Monitored Memory Location MWAIT Detection Hardware Resumes Core Thread Execution 18

M7 Query Offload Performance Example 11 10 9 8 7 6 5 4 3 2 1 0 Single Stream Decompression T5 Baseline (1 Thread) M7 (1 Query Pipeline) SPARC T5, M7 & Oracle Database In-Memory Single Stream Decompression Performance Decompression Stage of Query Acceleration Unaligned Bit-Packed Columnar Formats 1 of 32 Query Engine Pipelines M7 Hardware Accelerator Fuses Decompression Output with Filtering Operations Further Accelerates the WHERE Clause in Query In-line Predicate Evaluation Preserves Memory Bandwidth Business Analytics Performed at System Memory Bandwidth 19

SPARC Accelerated Program Enables Third Party Software to Utilize SPARC Application Acceleration Features Application Data Integrity Support in Future Solaris and DB Memory Allocators, Compiler and Tools Chain Future Solaris Support for Memory Migration and Virtual Address (VA) Masking Query Engine Software Libraries 20

M7 SMP Scalability >1 TB/s 7 Fully Connected 8 Processor SMP Up to 256 Cores, 2K Threads, 16TB Memory >1TB/s Bisection Payload Bandwidth Fully Connected 2/4 Processor SMP Utilizing Link Trunking Directory-based Coherence 16Gbps to 18Gbps Link Rates Link Level Dynamic Congestion Avoidance Alternate Path Data Routing Based on Destination Queue Utilization Link Level RAS Auto Frame Retry Auto Link Retrain Single Lane Failover 21

5.2 TB/s M7 SMP Scalability SMP Scalability to 32 Processors Up to 1K Cores, 8K Threads Up to 64TB Memory 64 Port Switch ASIC's Divided into 2 Switch Groups of 6 Switch ASICs 5.2TB/s Payload Bandwidth 4X Bisection Payload Bandwidth over M6 Physical Domains of 4 Processors Each Fully Connected Local Topology Dynamically Combine Processor Domains Fully Connected Switch Topology 2 Links Connecting Every Processor and Switch Coherence Directory Distributed Among Switches Latency Reduction over M6 Generation SMP Interconnect RAS Link Auto Retry & Retrain, Single Lane Failover Link Level Multipathing Operational with 5 of 6 ASIC's per Switch Group 22

M7 Coherent Memory Clusters Server Node 0 Memory Non-Cache Segment Remote Segment Node 2 C Remote Segment Node 1 B Home Segment A A' Server Node 1 Memory Non-Cache Segment Remote Segment Node 2 C Remote Segment Node 0 A Home Segment AX B C A A' B Cores/Caches Cores/Caches Server Node 2 Memory Non-Cache Segment D E F B Remote Segment Node 0 AX Remote Segment Node 1 B Home Segment AX C B C Cores/Caches Highly Reliable and Secure Shared Memory Clustering Technology Access Remote Node Memory Using Load/Store/Atomics, Relaxed Ordering Integrated User Level Messaging and RDMA Cacheable Memory Segments Remote Segments Cached in Local Node Memory & Caches at 64 Byte Granularity Load-Hit Latency Accessing Remote Segments Same as Local Node Memory and Caches Committed Stores to Remote Segments Update Home Node and Invalidate Remote Nodes Non-Cacheable Memory Segments Always Access Remote Node Memory Cluster-wide Security 64-bit Access Key Per Remote Request Memory Version Checking Across Cluster 23

M7 Coherent Memory Clusters 1.3 TB/s M7 Server Nodes M7 Server Nodes Cluster Switch Up to 64 Processor Cluster Combinations of 2P, 4P or 8P Server Nodes Leverages M7 SMP HW and Interconnect Coherent Memory Cluster Protocol Application Committed Stores Remain Consistent at the Home Node in Face of Requester Failure 2-Party Dialogs for Failure Isolation 1.3TB/s Bisection Payload Bandwidth Self Redundant Cluster Switch 64 Port Switch ASIC's 6 Switching Paths Between Each Processor Pair, Divided in 2 Groups Fault Tolerant Design Allowing Operation with a Single Switching Path per Group Automatic Link and Switch Failover Without Involving Application Software Copyright 2014 Oracle and/or its affiliates. All rights reserved. 24

M7 Summary Extreme Performance Computing Efficiency Optimized for Oracle Software Significant Increase in Processor Performance Further Increase Core and Thread Performance Increased Bandwidths Across Caches, Memory, Interconnects and I/O Very Large Memory Increased Virtualization Density Low Latency Application Migration Flexible Logical and Physical Partitioning Fine-grain Power Management Improved Security and Reliability via Real-time Application Data Integrity Concurrent Object Migration and Pointer Coloring Database In-Memory Columnar Decompression, Query Offload and Coherent 25 Memory Clusters

M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle

Acronyms ADI: Application Data Integrity ALU: Arithmetic Logic Unit BRU: Branch Unit FA: Fully Associative FGU: Floating Point & Graphics Unit LSU: Load/Store Unit PA: Physical Address RA: Real Address SA: Set Associative SMP: Shared Memory Multiprocessor SPU: Stream Processing Unit TLB: Instruction or Data Translation Lookaside Buffer VA: Virtual Address 27