Workloads, Scalability and QoS Considerations in CMP Platforms

Size: px

Start display at page:

Download "Workloads, Scalability and QoS Considerations in CMP Platforms"

Daniela Spencer
5 years ago
Views:

1 Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation

2 Agenda Trends and research context Evolving Workload Scenarios Platform Scalability Platform QoS CTG/STL/MPA Review w/ DAP 2

3 Trends - Cores, Threads, and Virtual Machines Today Tomorrow Not too distant future App App App App App App App App OS 1 OS 2 OS 1 OS 2 OS 3 OS n OS VM VMM VM1 VMM VM2 VM1 VM2 VM3 VMn VMM & I/O & I/O & I/O Lots of hardware CTG/STL/MPA threads Review w/ and DAP greater software diversity will challenge cache, 3 memory and I/O

4 Trends - Discussion Points Goal: Scaling up to 100 s of logical processors on a single CPU die Scaling hardware threads means provisioning and re-architecting other platform resources Scaling hardware threads means learning to share QoS revisited CTG/STL/MPA Review w/ DAP 4

5 Trends - Pool of Virtual Resources E.g. Front-End E.g. Mid-Tier E.g. Back-end Application Container Application Container Application Container Workload Workload Workload Workload Workload Workload Workload Workload Guest OS Guest OS Guest OS Virtual Resource Container CPU CPU Virtual Resource Container Virtual Resource Container CPU CPU Virtual Machine Monitor Creates, Deletes and Manages Dynamic Application Containers Processor Virtualization support Processor Resources (Many cores) Processor supported Virtualization Resources Processor supported Virtualization Resources CPU1 CPU1 CPU2 CPU2 CPU3 CPU1 CPU2 CPU3 CPU3 CPU CPU CPU CTG/STL/MPA Review w/ DAP 5

6 Trends - Decomposed OS Container Container Container Container i.e. OLTP Micro OS OS Kernel Service Device Driver OS Service Thread Application Thread Application Messaging interface Legacy OS Processor Virtualization Hypervisor Virtualization Virtualization Processor Resources (Many cores) Resources Resources CPU1 CPU2 CPU3 CPU1 CPU2 CPU3 CPU1 CPU2 CPU3 CPU CPU CPU HW Resources CTG/STL/MPA Review w/ DAP 6

7 Platform Scaling CTG/STL/MPA Review w/ DAP 7

8 Tera-Scale Workload Scenarios Server App SMP O/S Server App Server App Tera-Scale Platform SMP O/S SMP O/S Highly Multi-threaded Server Workloads OLTP, J2EE App Servers, E-commerce, ERP, Search, etc (TPC-C, SPECjbb2005, SAP, SPECjappserver2004, etc) VMM Tera-Scale Platform / Considerations: Performance => Overall Throughput Scalability => Headroom Quality of Service => Performance Isolation Virtualized Server Environments (Workload Consolidation, Datacenter-on-chip Scenarios) Let s begin with single server app.. And follow up with the consolidated scenario. CTG/STL/MPA Review w/ DAP 8

Tera-Scale Hierarchy Design Node 0 Node N-1 C0 C1 C2 C3 L1 L1 L1 L1 Mid-Level C0 C1 C2 C3 L1 L1 L1 L1 Mid-Level Small L1 shared between multiple threads on a core On-Die Interconnect

-- Reduced interconnect pressure 0.05 0.04 0.03 0.02 0.

9 Tera-Scale Hierarchy Design Node 0 Node N-1 C0 C1 C2 C3 L1 L1 L1 L1 Mid-Level C0 C1 C2 C3 L1 L1 L1 L1 Mid-Level Small L1 shared between multiple threads on a core On-Die Interconnect Moderate L2 shared between cores within the node LLC 0 LLC 1 LLC N-1 Hierarchical Sharing Benefits -- Reduces Replication -- Increases Effective Size -- Better localized communication -- Reduced interconnect pressure Distributed L3 shared by all nodes within the socket 0 MLC Performance 128K 256K 512K 512k 1M 2M 1 hardware thread - Private 4 hardware threads -Shared Shared MLC is equivalent to 2X of private MLCs datampi codempi Hierarchical sharing increases caching effectiveness significantly CTG/STL/MPA Review w/ DAP 9

10 OLTP Tera-Scale Case Study Core / Node Perfect Last-Level Single socket 32 cores/4 threads Multiple Socket 1 Large Core ~4 Small Cores OUCH OUCH Small core ~50% large core performance on server workloads 2X Performance Potential memory bottleneck Need 100+ GB of memory B/W System Interconnect bottleneck Need 40+ GB of B/W Significant Performance Potential => Scalability Challenges CTG/STL/MPA Review w/ DAP 10

11 Scalability Issues Tera-scale Headroom requirements Start with 32 cores in 1st gen; Maybe grow to 48 cores in next gen? How much memory & interconnect bandwidth will we need? Mem LLC CPU M LLC CPU Sys Interconnect LLC CPU M 32C / skt: GB/s 48C / skt: GB/s H 32C / skt: ~40GB 48C / skt: ~50-70GB H Scaling Issues Just Get Worse CTG/STL/MPA Review w/ DAP 11

12 On-Socket DRAM s (For Scalability) Enable Large Capacity L4s Low Latency High Bandwidth 3D stack Proc DRAM $ MCP Proc DRAM $ Technologies 3D Stacking Multi-chip Packages (MCP) Benefits Significant reduction in miss rate Avoids bandwidth wall Miss Rate (%) 80% 70% 60% 50% 40% 30% 20% 10% 0% Impact of Large DRAM$ (Major Server Workloads) SPECJBB TPCC SPECJAppServer SAP Per Thread Size (MB) CTG/STL/MPA Review w/ DAP 12

13 QoS and Performance Management CTG/STL/MPA Review w/ DAP 13

14 Background for QoS Discussion Trends Increasing Core Count for Performance Increasing Workload Diversity Multi-Workload Scenarios in the Client Virtualization and Consolidation in the Server Heterogeneous Architectures for Graphics Observations Multi-core enables simultaneous execution of multiple workloads But not all Workloads are equal -- users do have preferences How well does the user-preferred application run? Should platforms optimize for the user-preferred application? CTG/STL/MPA Review w/ DAP 14

15 Resource Management Capitalist No management of resources If you can generate more requests, you will use more resources Grab as you will E.g. All of today s policies Elitist Focus on individual efficiency Provide more performance and resources to the VIP Limited resources to non-vip E.g. Service Level Agreements, Foreground/Background Communist/Fair Fair distribution of resources Give equal share of resources to all executing threads Does not necessarily guarantee equal performance E.g. Partitioning resources for fairness and isolation Utilitarian Focus on overall efficiency Give more resource to those that need it the most, less to others E.g. -friendly vs. Unfriendly, resource-aware scheduling CTG/STL/MPA Review w/ DAP 15

The Multi-Workload Problem Foreground Office Application Background Image Recognition Execution Time of a foreground application Mem Core F F B B Core Execution Time Significant response time

16 The Multi-Workload Problem Foreground Office Application Background Image Recognition Execution Time of a foreground application Mem Core F F B B Core Execution Time Significant response time slowdown 50% I/O F Foreground application Running Alone Running with background application Conroe core 2 Duo Measurements Platform does not distinguish in resource allocation Preferred (foreground) application can suffer significant slow down Platform QoS can improve user-preferred application performance CTG/STL/MPA Review w/ DAP 16

Contention - Client side examples Runtime Slowdown 5x 4x 3x 2x 1x mcf art equak 30% exhibit 20%-5X slowdown 20% pairs 10-20% slowdown vpr ammp facer bzip2 lucas vorte galge wupwi

bzip2 wupwise swim eon ammp applu appsi crafty facerec fma3d galgel gap gcc gzip lucas mesa mgrid parser perlbmk sixtrack twolf vortex vpr Similar data collected for server

17 Contention - Client side examples Runtime Slowdown 5x 4x 3x 2x 1x mcf art equak 30% exhibit 20%-5X slowdown 20% pairs 10-20% slowdown vpr ammp facer bzip2 lucas vorte galge wupwi applu Measured on a Dual core Conroe (4M cache, DDR2 667) Rest exhibit <10% slowdown swim parse mgrid twolf gcc fma3d gap perlb craft apsi mesa sixtr gzip eon mcf art equake bzip2 wupwise swim eon ammp applu appsi crafty facerec fma3d galgel gap gcc gzip lucas mesa mgrid parser perlbmk sixtrack twolf vortex vpr Similar data collected for server applications Corporate Client Example: SPEC Pairs / Contention Resource contention will impair the performance of important apps CTG/STL/MPA Review w/ DAP 17 (Performance Differentiation)

18 Application Behavior & Overall Performance Destructive (streaming) D1 No1 Core Core Core Core Core Core Interference Normal (typical) C1 Constructive (co-op threads) Ne1 Core Core Inefficient Sharing Many Heterogeneous Applications Neutral (little data) Managing resource contention can improve overall throughput too C2 CTG/STL/MPA Review w/ DAP 18 Response Time (Performance Management) Potential to improve overall performance Monitor resource usage and group/partition accordingly 300% 250% 200% 150% 100% 50% 0% Example Benefits Performance Loss No understanding of resource usage behavior Windows Scheduling swim eon gcc mcf twolf gzip Equake gap Resource-Aware Scheduling Clovertown (8Core / 8App) Experiments (used destructive, normal and neutral apps)

19 Service Level Agreements in the Enterprise Server Consolidation OLTP VM Middle Tier VM Network App1 App2 App3 Core Core Core Core Utility SLA Disparate resource usage can cause performance isolation concerns App1 Hi App2 Med App3 Lo Many Different Virtual Machines SLA management Prioritized resource allocation can help address performance isolation and SLA Disparate resource usage and contention hurts SLAs CTG/STL/MPA Review w/ DAP 19 (Need for Performance Isolation and SLA enforcement)

20 Shared Platform Resources Hi Priority Workloads Low Priority Workloads App/VM OS/VMM App/VM OS Visible (Managed Resources) T1 Core T2 T3 T4 HW threads OS Invisible (Unmanaged Resources) CoreCoreCoreCore Shared Microarchitectural resources U-Arch Power Mgmt Coarse Grain Platform Power Management Space Mem BW & Latency Shared Core resources BW & Latency CTG/STL/MPA Review w/ DAP 20

21 Problem Summary CMP Many heterogeneous threads, apps, VMs Not all applications are equal Users have preferences End-users (client) want to treat foreground preferentially End-users (server) want service level differentiation (SLA) or isolation Applications use resources differently Destructive vs. Constructive vs. Neutral Threads No performance management to protect from bad behavior Priority-based OS scheduling no longer sufficient With more cores, OS will allow high and low priority applications to run simultaneously and contend for resources Low priority applications will steal platform resources from high priority apps loss in performance & user experience Platform has no support for application differentiation Platform has no knowledge of preferences or resource usage Platform has no support for fine-grained tracking of many shared resources that have significant performance value CTG/STL/MPA Review w/ DAP 21

22 Platform QoS Software Domain Hardware Domain Feedback Resource Monitoring QoS Exposure HW Policies QoS Hints via Architectural Interface Resource Enforcement QoS Enabled Resources CPU Core Power uarch resource usage Space Bandwidth and Latency I/O Response Time Voltage/ Frequency CTG/STL/MPA Review w/ DAP 22 Goals Preferential Treatment of VIP, Better Overall Throughput

23 Visible QoS Spectrum (/) App 1 App 2 App 3 No monitoring No enforcement Class 1 Class 2 Monitoring Per App* App 1 App 2 App 3 Enforcement Per Class Full monitoring Per App App 1 App 2 App 3 E.g. 2MB, 6GB/s Full enforcement Per App No determinism Classes of Service Guarantee per App No QoS Possible Approach Per-App Partitions No complexity ~ Good resource usage due to greedy approach ~ Okay throughput Incremental Complexity Bridges the gap Extensible Architecture Satisfies Requirements for perf management, perf. differentiation Significant Complexity ~ Lower resource usage due to per VM guarantee ~ Poorer throughput CTG/STL/MPA Review w/ DAP 23

24 QoS Aware Architecture - High Priority Low Priority Set Application s Platform Priority Application Platform Priority QoS Aware OS/VMM: Platform QoS Exposure: Priority added QoS Aware OS/VMM to App state Platform QoS Register App State OS App State QOS Interface Expose QoS Interface Platform Priority Sent through PQR Platform QOS Register Requests tagged with Priority Core 0 Core 1 Resource Monitoring: Monitor cache utilization per application Hi Hi Lo Lo Resource Enforcement: Enforce cache utilization for priority levels CTG/STL/MPA Review w/ DAP 24

25 / QoS Benefits Client SPEC Case Study Response Time (Normalized) QoS Potential ( QoS + QoS) 6.0 x 5.0 x 4.0 x 3.0 x 2.0 x 1.0 x 0.0 x Art (Hi) Swim (Lo) Unmanaged Sharing QoS (BW reservation for Hi) QoS (Lo restricted to 20%) + Mem QoS Dedicated (Best Case) mcf (Hi) Swim (Lo) Response Time -- Lower is better Based on Measurements, Simulations and Analytical Projections Significant Benefits of / QoS CTG/STL/MPA Review w/ DAP 25

26 QoS Aware Architecture: Power High Priority Low Priority Application Platform Priority App State OS App State QOS Interface Power management knobs - Voltage/Frequency scaling - Issue restriction - Gating Core Fast0 Core Slow1 QoS Monitoring,,, Power Hi Platform QOS Register Hi Lo Lo Resource Enforcement#1 Victimize low priority for Power QoS Resource Enforcement#2 Use power throttling to enforce Performance QoS CTG/STL/MPA Review w/ DAP 26

27 Power QoS Benefits High Priority Low Priority 100% Fast Core Freq 100% Slow 39% 89% Core Freq Platform Resources 5x Performance impact with CPU clock throttling User-Preferred App Slowdown 4x 3x 2x 1x mcf 100% freq 89% freq art equak vpr 39% freq Max frequency (100%) to low priority Max slowdown to High Priority Application ammp facer bzip2 Reduced frequency (89%) for low priority -> Reduced Slowdown for Hi Priority lucas vorte galge wupwi applu swim parse mgrid twolf gcc fma3d gap low frequency (39%) to low priority Minimal slowdown for Hi Priority perlb craft apsi mesa sixtr gzip Improves performance of the user-preferred application CTG/STL/MPA Review w/ DAP 27 eon Swim as low priority application

28 Virtualization: From VMs to VPAs (Managing Transparent Resources) Virtual Machine Virtual Platform Virtual Machine Architecture Virtual Machine Workload Workload Workload Workload Workload Workload Workload Workload Workload Guest OS Guest OS Guest OS Virtual Resource Container CPU Capacity Virtual Resource Container Virtual Resource Container CPU Capacity Virtual Machine Monitor manages CPU, and memory capacity Virtual Platform Architectures manages performance with PQoS Processor Virtualization support Processor supported Virtualization / Allocation Power Management Processor Resources (Many cores) CPU1 CPU1 CPU2 CPU2 CPU3 CPU1 CPU2 CPU3 CPU CPU CPU3 CPU Resources Resources Performance Isolation / SLA Differentiation motivates B/W Capcity Capacity Capacity B/W Capacity B/W Virtual Platform Architectures V/f, Instruction & Issue Throttling, Average Average Power Power CTG/STL/MPA Review w/ DAP 28

29 Summary Large-scale CMP is going to happen Lots of work to be done to identify and remove platform and architectural limitations preventing applications and execution environments from scaling up to 100 s of logical processors on a single CPU die Scalability concerns can be addressed Hierarchy of Shared s Large DRAM caches QoS concerns can be addressed Dynamic Allocation ( QoS) Dynamic Power Management (Power QoS) Smart performance management requires more visibility be available to the execution environment Resource utilization counters for schedulers, etc. CTG/STL/MPA Review w/ DAP 29

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background