Workloads, Scalability and QoS Considerations in CMP Platforms
|
|
- Daniela Spencer
- 5 years ago
- Views:
Transcription
1 Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation
2 Agenda Trends and research context Evolving Workload Scenarios Platform Scalability Platform QoS CTG/STL/MPA Review w/ DAP 2
3 Trends - Cores, Threads, and Virtual Machines Today Tomorrow Not too distant future App App App App App App App App OS 1 OS 2 OS 1 OS 2 OS 3 OS n OS VM VMM VM1 VMM VM2 VM1 VM2 VM3 VMn VMM & I/O & I/O & I/O Lots of hardware CTG/STL/MPA threads Review w/ and DAP greater software diversity will challenge cache, 3 memory and I/O
4 Trends - Discussion Points Goal: Scaling up to 100 s of logical processors on a single CPU die Scaling hardware threads means provisioning and re-architecting other platform resources Scaling hardware threads means learning to share QoS revisited CTG/STL/MPA Review w/ DAP 4
5 Trends - Pool of Virtual Resources E.g. Front-End E.g. Mid-Tier E.g. Back-end Application Container Application Container Application Container Workload Workload Workload Workload Workload Workload Workload Workload Guest OS Guest OS Guest OS Virtual Resource Container CPU CPU Virtual Resource Container Virtual Resource Container CPU CPU Virtual Machine Monitor Creates, Deletes and Manages Dynamic Application Containers Processor Virtualization support Processor Resources (Many cores) Processor supported Virtualization Resources Processor supported Virtualization Resources CPU1 CPU1 CPU2 CPU2 CPU3 CPU1 CPU2 CPU3 CPU3 CPU CPU CPU CTG/STL/MPA Review w/ DAP 5
6 Trends - Decomposed OS Container Container Container Container i.e. OLTP Micro OS OS Kernel Service Device Driver OS Service Thread Application Thread Application Messaging interface Legacy OS Processor Virtualization Hypervisor Virtualization Virtualization Processor Resources (Many cores) Resources Resources CPU1 CPU2 CPU3 CPU1 CPU2 CPU3 CPU1 CPU2 CPU3 CPU CPU CPU HW Resources CTG/STL/MPA Review w/ DAP 6
7 Platform Scaling CTG/STL/MPA Review w/ DAP 7
8 Tera-Scale Workload Scenarios Server App SMP O/S Server App Server App Tera-Scale Platform SMP O/S SMP O/S Highly Multi-threaded Server Workloads OLTP, J2EE App Servers, E-commerce, ERP, Search, etc (TPC-C, SPECjbb2005, SAP, SPECjappserver2004, etc) VMM Tera-Scale Platform / Considerations: Performance => Overall Throughput Scalability => Headroom Quality of Service => Performance Isolation Virtualized Server Environments (Workload Consolidation, Datacenter-on-chip Scenarios) Let s begin with single server app.. And follow up with the consolidated scenario. CTG/STL/MPA Review w/ DAP 8
9 Tera-Scale Hierarchy Design Node 0 Node N-1 C0 C1 C2 C3 L1 L1 L1 L1 Mid-Level C0 C1 C2 C3 L1 L1 L1 L1 Mid-Level Small L1 shared between multiple threads on a core On-Die Interconnect Moderate L2 shared between cores within the node LLC 0 LLC 1 LLC N-1 Hierarchical Sharing Benefits -- Reduces Replication -- Increases Effective Size -- Better localized communication -- Reduced interconnect pressure Distributed L3 shared by all nodes within the socket 0 MLC Performance 128K 256K 512K 512k 1M 2M 1 hardware thread - Private 4 hardware threads -Shared Shared MLC is equivalent to 2X of private MLCs datampi codempi Hierarchical sharing increases caching effectiveness significantly CTG/STL/MPA Review w/ DAP 9
10 OLTP Tera-Scale Case Study Core / Node Perfect Last-Level Single socket 32 cores/4 threads Multiple Socket 1 Large Core ~4 Small Cores OUCH OUCH Small core ~50% large core performance on server workloads 2X Performance Potential memory bottleneck Need 100+ GB of memory B/W System Interconnect bottleneck Need 40+ GB of B/W Significant Performance Potential => Scalability Challenges CTG/STL/MPA Review w/ DAP 10
11 Scalability Issues Tera-scale Headroom requirements Start with 32 cores in 1st gen; Maybe grow to 48 cores in next gen? How much memory & interconnect bandwidth will we need? Mem LLC CPU M LLC CPU Sys Interconnect LLC CPU M 32C / skt: GB/s 48C / skt: GB/s H 32C / skt: ~40GB 48C / skt: ~50-70GB H Scaling Issues Just Get Worse CTG/STL/MPA Review w/ DAP 11
12 On-Socket DRAM s (For Scalability) Enable Large Capacity L4s Low Latency High Bandwidth 3D stack Proc DRAM $ MCP Proc DRAM $ Technologies 3D Stacking Multi-chip Packages (MCP) Benefits Significant reduction in miss rate Avoids bandwidth wall Miss Rate (%) 80% 70% 60% 50% 40% 30% 20% 10% 0% Impact of Large DRAM$ (Major Server Workloads) SPECJBB TPCC SPECJAppServer SAP Per Thread Size (MB) CTG/STL/MPA Review w/ DAP 12
13 QoS and Performance Management CTG/STL/MPA Review w/ DAP 13
14 Background for QoS Discussion Trends Increasing Core Count for Performance Increasing Workload Diversity Multi-Workload Scenarios in the Client Virtualization and Consolidation in the Server Heterogeneous Architectures for Graphics Observations Multi-core enables simultaneous execution of multiple workloads But not all Workloads are equal -- users do have preferences How well does the user-preferred application run? Should platforms optimize for the user-preferred application? CTG/STL/MPA Review w/ DAP 14
15 Resource Management Capitalist No management of resources If you can generate more requests, you will use more resources Grab as you will E.g. All of today s policies Elitist Focus on individual efficiency Provide more performance and resources to the VIP Limited resources to non-vip E.g. Service Level Agreements, Foreground/Background Communist/Fair Fair distribution of resources Give equal share of resources to all executing threads Does not necessarily guarantee equal performance E.g. Partitioning resources for fairness and isolation Utilitarian Focus on overall efficiency Give more resource to those that need it the most, less to others E.g. -friendly vs. Unfriendly, resource-aware scheduling CTG/STL/MPA Review w/ DAP 15
16 The Multi-Workload Problem Foreground Office Application Background Image Recognition Execution Time of a foreground application Mem Core F F B B Core Execution Time Significant response time slowdown 50% I/O F Foreground application Running Alone Running with background application Conroe core 2 Duo Measurements Platform does not distinguish in resource allocation Preferred (foreground) application can suffer significant slow down Platform QoS can improve user-preferred application performance CTG/STL/MPA Review w/ DAP 16
17 Contention - Client side examples Runtime Slowdown 5x 4x 3x 2x 1x mcf art equak 30% exhibit 20%-5X slowdown 20% pairs 10-20% slowdown vpr ammp facer bzip2 lucas vorte galge wupwi applu Measured on a Dual core Conroe (4M cache, DDR2 667) Rest exhibit <10% slowdown swim parse mgrid twolf gcc fma3d gap perlb craft apsi mesa sixtr gzip eon mcf art equake bzip2 wupwise swim eon ammp applu appsi crafty facerec fma3d galgel gap gcc gzip lucas mesa mgrid parser perlbmk sixtrack twolf vortex vpr Similar data collected for server applications Corporate Client Example: SPEC Pairs / Contention Resource contention will impair the performance of important apps CTG/STL/MPA Review w/ DAP 17 (Performance Differentiation)
18 Application Behavior & Overall Performance Destructive (streaming) D1 No1 Core Core Core Core Core Core Interference Normal (typical) C1 Constructive (co-op threads) Ne1 Core Core Inefficient Sharing Many Heterogeneous Applications Neutral (little data) Managing resource contention can improve overall throughput too C2 CTG/STL/MPA Review w/ DAP 18 Response Time (Performance Management) Potential to improve overall performance Monitor resource usage and group/partition accordingly 300% 250% 200% 150% 100% 50% 0% Example Benefits Performance Loss No understanding of resource usage behavior Windows Scheduling swim eon gcc mcf twolf gzip Equake gap Resource-Aware Scheduling Clovertown (8Core / 8App) Experiments (used destructive, normal and neutral apps)
19 Service Level Agreements in the Enterprise Server Consolidation OLTP VM Middle Tier VM Network App1 App2 App3 Core Core Core Core Utility SLA Disparate resource usage can cause performance isolation concerns App1 Hi App2 Med App3 Lo Many Different Virtual Machines SLA management Prioritized resource allocation can help address performance isolation and SLA Disparate resource usage and contention hurts SLAs CTG/STL/MPA Review w/ DAP 19 (Need for Performance Isolation and SLA enforcement)
20 Shared Platform Resources Hi Priority Workloads Low Priority Workloads App/VM OS/VMM App/VM OS Visible (Managed Resources) T1 Core T2 T3 T4 HW threads OS Invisible (Unmanaged Resources) CoreCoreCoreCore Shared Microarchitectural resources U-Arch Power Mgmt Coarse Grain Platform Power Management Space Mem BW & Latency Shared Core resources BW & Latency CTG/STL/MPA Review w/ DAP 20
21 Problem Summary CMP Many heterogeneous threads, apps, VMs Not all applications are equal Users have preferences End-users (client) want to treat foreground preferentially End-users (server) want service level differentiation (SLA) or isolation Applications use resources differently Destructive vs. Constructive vs. Neutral Threads No performance management to protect from bad behavior Priority-based OS scheduling no longer sufficient With more cores, OS will allow high and low priority applications to run simultaneously and contend for resources Low priority applications will steal platform resources from high priority apps loss in performance & user experience Platform has no support for application differentiation Platform has no knowledge of preferences or resource usage Platform has no support for fine-grained tracking of many shared resources that have significant performance value CTG/STL/MPA Review w/ DAP 21
22 Platform QoS Software Domain Hardware Domain Feedback Resource Monitoring QoS Exposure HW Policies QoS Hints via Architectural Interface Resource Enforcement QoS Enabled Resources CPU Core Power uarch resource usage Space Bandwidth and Latency I/O Response Time Voltage/ Frequency CTG/STL/MPA Review w/ DAP 22 Goals Preferential Treatment of VIP, Better Overall Throughput
23 Visible QoS Spectrum (/) App 1 App 2 App 3 No monitoring No enforcement Class 1 Class 2 Monitoring Per App* App 1 App 2 App 3 Enforcement Per Class Full monitoring Per App App 1 App 2 App 3 E.g. 2MB, 6GB/s Full enforcement Per App No determinism Classes of Service Guarantee per App No QoS Possible Approach Per-App Partitions No complexity ~ Good resource usage due to greedy approach ~ Okay throughput Incremental Complexity Bridges the gap Extensible Architecture Satisfies Requirements for perf management, perf. differentiation Significant Complexity ~ Lower resource usage due to per VM guarantee ~ Poorer throughput CTG/STL/MPA Review w/ DAP 23
24 QoS Aware Architecture - High Priority Low Priority Set Application s Platform Priority Application Platform Priority QoS Aware OS/VMM: Platform QoS Exposure: Priority added QoS Aware OS/VMM to App state Platform QoS Register App State OS App State QOS Interface Expose QoS Interface Platform Priority Sent through PQR Platform QOS Register Requests tagged with Priority Core 0 Core 1 Resource Monitoring: Monitor cache utilization per application Hi Hi Lo Lo Resource Enforcement: Enforce cache utilization for priority levels CTG/STL/MPA Review w/ DAP 24
25 / QoS Benefits Client SPEC Case Study Response Time (Normalized) QoS Potential ( QoS + QoS) 6.0 x 5.0 x 4.0 x 3.0 x 2.0 x 1.0 x 0.0 x Art (Hi) Swim (Lo) Unmanaged Sharing QoS (BW reservation for Hi) QoS (Lo restricted to 20%) + Mem QoS Dedicated (Best Case) mcf (Hi) Swim (Lo) Response Time -- Lower is better Based on Measurements, Simulations and Analytical Projections Significant Benefits of / QoS CTG/STL/MPA Review w/ DAP 25
26 QoS Aware Architecture: Power High Priority Low Priority Application Platform Priority App State OS App State QOS Interface Power management knobs - Voltage/Frequency scaling - Issue restriction - Gating Core Fast0 Core Slow1 QoS Monitoring,,, Power Hi Platform QOS Register Hi Lo Lo Resource Enforcement#1 Victimize low priority for Power QoS Resource Enforcement#2 Use power throttling to enforce Performance QoS CTG/STL/MPA Review w/ DAP 26
27 Power QoS Benefits High Priority Low Priority 100% Fast Core Freq 100% Slow 39% 89% Core Freq Platform Resources 5x Performance impact with CPU clock throttling User-Preferred App Slowdown 4x 3x 2x 1x mcf 100% freq 89% freq art equak vpr 39% freq Max frequency (100%) to low priority Max slowdown to High Priority Application ammp facer bzip2 Reduced frequency (89%) for low priority -> Reduced Slowdown for Hi Priority lucas vorte galge wupwi applu swim parse mgrid twolf gcc fma3d gap low frequency (39%) to low priority Minimal slowdown for Hi Priority perlb craft apsi mesa sixtr gzip Improves performance of the user-preferred application CTG/STL/MPA Review w/ DAP 27 eon Swim as low priority application
28 Virtualization: From VMs to VPAs (Managing Transparent Resources) Virtual Machine Virtual Platform Virtual Machine Architecture Virtual Machine Workload Workload Workload Workload Workload Workload Workload Workload Workload Guest OS Guest OS Guest OS Virtual Resource Container CPU Capacity Virtual Resource Container Virtual Resource Container CPU Capacity Virtual Machine Monitor manages CPU, and memory capacity Virtual Platform Architectures manages performance with PQoS Processor Virtualization support Processor supported Virtualization / Allocation Power Management Processor Resources (Many cores) CPU1 CPU1 CPU2 CPU2 CPU3 CPU1 CPU2 CPU3 CPU CPU CPU3 CPU Resources Resources Performance Isolation / SLA Differentiation motivates B/W Capcity Capacity Capacity B/W Capacity B/W Virtual Platform Architectures V/f, Instruction & Issue Throttling, Average Average Power Power CTG/STL/MPA Review w/ DAP 28
29 Summary Large-scale CMP is going to happen Lots of work to be done to identify and remove platform and architectural limitations preventing applications and execution environments from scaling up to 100 s of logical processors on a single CPU die Scalability concerns can be addressed Hierarchy of Shared s Large DRAM caches QoS concerns can be addressed Dynamic Allocation ( QoS) Dynamic Power Management (Power QoS) Smart performance management requires more visibility be available to the execution environment Resource utilization counters for schedulers, etc. CTG/STL/MPA Review w/ DAP 29
Impact of Cache Coherence Protocols on the Processing of Network Traffic
Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background
More informationCacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms
CacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms Li Zhao, Ravi Iyer, Ramesh Illikkal, Jaideep Moses, Srihari Makineni and Don Newell System Technology Lab, Intel Corporation, Hillsboro,
More information15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical
More informationRegister Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York
More informationWhich is the best? Measuring & Improving Performance (if planes were computers...) An architecture example
1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles
More informationAries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX
Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554
More informationMicroarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.
Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical
More informationQoS Policies and Architecture for Cache/Memory in CMP Platforms
QoS Policies and Architecture for Cache/Memory in CMP Platforms Ravi Iyer, Li Zhao, Fei Guo 2, Ramesh Illikkal, Srihari Makineni, Don Newell, Yan Solihin 2, Lisa Hsu 3, Steve Reinhardt 3 Intel Corporation
More informationEECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun
EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,
More informationNo Tradeoff Low Latency + High Efficiency
No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS),
More informationImproving Cache Performance using Victim Tag Stores
Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com
More informationOptimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service
Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications
More informationA Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn
A Cross-Architectural Interface for Code Cache Manipulation Kim Hazelwood and Robert Cohn Software-Managed Code Caches Software-managed code caches store transformed code at run time to amortize overhead
More informationInserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1
I Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL Inserting Prefetches IA-32 Execution Layer - 1 Agenda IA-32EL Brief Overview Prefetching in Loops IA-32EL Prefetching in
More informationDesigning High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC
Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system
More informationPerformance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor
Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University
More informationUnderstanding Cache Interference
Understanding Cache Interference by M.W.A. Settle B.A., University of Colorado, 1996 M.S., University of Colorado, 2001 A thesis submitted to the Faculty of the Graduate School of the University of Colorado
More informationProbabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs
University of Maryland Technical Report UMIACS-TR-2008-13 Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs Wanli Liu and Donald Yeung Department of Electrical and Computer Engineering
More informationCS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 41 Performance II CS61C L41 Performance II (1) Lecturer PSOE Dan Garcia www.cs.berkeley.edu/~ddgarcia UWB Ultra Wide Band! The FCC moved
More informationCSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading
CSE 502 Graduate Computer Architecture Lec 11 Simultaneous Multithreading Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,
More informationLow-Complexity Reorder Buffer Architecture*
Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower
More informationDecoupled Zero-Compressed Memory
Decoupled Zero-Compressed Julien Dusser julien.dusser@inria.fr André Seznec andre.seznec@inria.fr Centre de recherche INRIA Rennes Bretagne Atlantique Campus de Beaulieu, 3542 Rennes Cedex, France Abstract
More informationD E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager
1 T HE D E N A L I N E X T - G E N E R A T I O N H I G H - D E N S I T Y S T O R A G E I N T E R F A C E Laura Caulfield Senior Software Engineer Arie van der Hoeven Principal Program Manager Outline Technology
More informationThesis Contributions (Cont.) Question: Does compression help CMP performance? Contribution #3: Evaluate CMP cache and link compression
Using to Improve Chip Multiprocessor Alaa R. Alameldeen Dissertation Defense Wisconsin Multifacet Project University of Wisconsin-Madison http://www.cs.wisc.edu/multifacet Thesis Contributions (Cont.)
More informationComputer System. Performance
Computer System Performance Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/
More informationThe Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation
Noname manuscript No. (will be inserted by the editor) The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham Received:
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationBank-aware Dynamic Cache Partitioning for Multicore Architectures
Bank-aware Dynamic Cache Partitioning for Multicore Architectures Dimitris Kaseridis, Jeffrey Stuecheli and Lizy K. John Department of Electrical and Computer Engineering, The University of Texas at Austin,
More informationHigh Performance Memory Requests Scheduling Technique for Multicore Processors
High Performance Memory Requests Scheduling Technique for Multicore Processors Walid El-Reedy Electronics and Comm. Engineering Cairo University, Cairo, Egypt walid.elreedy@gmail.com Ali A. El-Moursy Electrical
More informationBetter than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin Sangyeun Cho Department of Computer Science University of Pittsburgh jinlei,cho@cs.pitt.edu Abstract Private
More informationRelative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review
Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India
More informationA Power and Temperature Aware DRAM Architecture
A Power and Temperature Aware DRAM Architecture Song Liu, Seda Ogrenci Memik, Yu Zhang, and Gokhan Memik Department of Electrical Engineering and Computer Science Northwestern University, Evanston, IL
More informationShengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota
Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationStatistical Simulation of Chip Multiprocessors Running Multi-Program Workloads
Statistical Simulation of Chip Multiprocessors Running Multi-Program Workloads Davy Genbrugge Lieven Eeckhout ELIS Depment, Ghent University, Belgium Email: {dgenbrug,leeckhou}@elis.ugent.be Abstract This
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationQoS support for Intelligent Storage Devices
QoS support for Intelligent Storage Devices Joel Wu Scott Brandt Department of Computer Science University of California Santa Cruz ISW 04 UC Santa Cruz Mixed-Workload Requirement General purpose systems
More informationExploiting Streams in Instruction and Data Address Trace Compression
Exploiting Streams in Instruction and Data Address Trace Compression Aleksandar Milenkovi, Milena Milenkovi Electrical and Computer Engineering Dept., The University of Alabama in Huntsville Email: {milenka
More informationReducing Overhead for Soft Error Coverage in High Availability Systems
Reducing Overhead for Soft Error Coverage in High Availability Systems Nidhi Aggarwal, Norman P. Jouppi, James E. Smith Parthasarathy Ranganathan and Kewal K. Saluja Abstract High reliability/availability
More informationUsing Aggressor Thread Information to Improve Shared Cache Management for CMPs
Appears in Proc. of the 18th Int l Conf. on Parallel Architectures and Compilation Techniques. Raleigh, NC. Sept. 2009. Using Aggressor Thread Information to Improve Shared Cache Management for CMPs Wanli
More informationqtlb: Looking inside the Look-aside buffer
qtlb: Looking inside the Look-aside buffer Omesh Tickoo 1, Hari Kannan 2, Vineet Chadha 3, Ramesh Illikkal 1, Ravi Iyer 1, and Donald Newell 1 1 Intel Corporation, 2111 NE 25th Ave., Hillsboro OR, USA,
More information3D Memory Architecture. Kyushu University
3D Memory Architecture Koji Inoue Kyushu University 1 Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Memory Hierarchy Run time Optimization Conclusions 2 Outline Why 3D? Will 3D
More informationParallel Computing 38 (2012) Contents lists available at SciVerse ScienceDirect. Parallel Computing
Parallel Computing 38 (2012) 533 551 Contents lists available at SciVerse ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Algorithm-level Feedback-controlled Adaptive data
More informationVirtual Private Machines: A Resource Abstraction for Multi-Core Computer Systems
Virtual Private Machines: A Resource Abstraction for Multi-Core Computer Systems Kyle J. Nesbit University of Wisconsin Madison Department of Electrical and Computer Engineering kjnesbit@ece.wisc.edu James
More informationDynamically Controlled Resource Allocation in SMT Processors
Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona
More informationStaged Tuning: A Hybrid (Compile/Install-time) Technique for Improving Utilization of Performance-asymmetric Multicores
Computer Science Technical Reports Computer Science Summer 6-29-2015 Staged Tuning: A Hybrid (Compile/Install-time) Technique for Improving Utilization of Performance-asymmetric Multicores Tyler Sondag
More informationPerformance analysis of Intel Core 2 Duo processor
Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 27 Performance analysis of Intel Core 2 Duo processor Tribuvan Kumar Prakash Louisiana State University and Agricultural
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationHardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.
Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT
More informationThe V-Way Cache : Demand-Based Associativity via Global Replacement
The V-Way Cache : Demand-Based Associativity via Global Replacement Moinuddin K. Qureshi David Thompson Yale N. Patt Department of Electrical and Computer Engineering The University of Texas at Austin
More informationBoost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor
Boost Sequential Program Performance Using A Virtual Large Instruction Window on Chip Multicore Processor Liqiang He Inner Mongolia University Huhhot, Inner Mongolia 010021 P.R.China liqiang@imu.edu.cn
More informationArchitecture Cloning For PowerPC Processors. Edwin Chan, Raul Silvera, Roch Archambault IBM Toronto Lab Oct 17 th, 2005
Architecture Cloning For PowerPC Processors Edwin Chan, Raul Silvera, Roch Archambault edwinc@ca.ibm.com IBM Toronto Lab Oct 17 th, 2005 Outline Motivation Implementation Details Results Scenario Previously,
More informationEnergy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012
Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso
More informationPrecise Instruction Scheduling
Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University
More informationNVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory
NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory Dhananjoy Das, Sr. Systems Architect SanDisk Corp. 1 Agenda: Applications are KING! Storage landscape (Flash / NVM)
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationComputer Science 246. Computer Architecture
Computer Architecture Spring 2010 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture Outline Performance Metrics Averaging Amdahl s Law Benchmarks The CPU Performance Equation Optimal
More informationHigh Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi Chang Joo Lee Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The
More informationHASS: A Scheduler for Heterogeneous Multicore Systems
HASS: A Scheduler for Heterogeneous Multicore Systems Daniel Shelepov 1 dsa5@cs.sfu.ca Alexandra Fedorova 1 fedorova@cs.sfu.ca Sergey Blagodurov 1 sba70@cs.sfu.ca 1 Simon Fraser University 8888 University
More informationMany Cores, One Thread: Dean Tullsen University of California, San Diego
Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and
More informationBest Practices for Setting BIOS Parameters for Performance
White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page
More informationMulticore Computing and the Cloud Optimizing Systems with Virtualization
Multicore Computing and the Cloud Optimizing Systems with Virtualization Michael Gschwind IBM Systems and Technology Group IBM Corporation Major Forces Are Driving the Need For IT Transformation Operational
More informationSoftware-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain
Software-assisted Cache Mechanisms for Embedded Systems by Prabhat Jain Bachelor of Engineering in Computer Engineering Devi Ahilya University, 1986 Master of Technology in Computer and Information Technology
More informationDistributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1
Distributed File Systems Issues NFS (Network File System) Naming and transparency (location transparency versus location independence) Host:local-name Attach remote directories (mount) Single global name
More informationPerformance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware
Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware 2010 VMware Inc. All rights reserved About the Speaker Hemant Gaidhani Senior Technical
More informationWhat s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1
What s New in VMware vsphere 4.1 Performance VMware vsphere 4.1 T E C H N I C A L W H I T E P A P E R Table of Contents Scalability enhancements....................................................................
More informationIntel Enterprise Processors Technology
Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationVIRTUALIZING SERVER CONNECTIVITY IN THE CLOUD
VIRTUALIZING SERVER CONNECTIVITY IN THE CLOUD Truls Myklebust Director, Product Management Brocade Communications 2011 Brocade Communciations - All Rights Reserved 13 October 2011 THE ENTERPRISE IS GOING
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationBalanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders
Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders Chuanjun Zhang Department of Computer Science and Electrical Engineering University of Missouri-Kansas City
More informationInstruction Based Memory Distance Analysis and its Application to Optimization
Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang cfang@mtu.edu Steve Carr carr@mtu.edu Soner Önder soner@mtu.edu Department of Computer Science Michigan Technological
More informationAre we ready for high-mlp?
Are we ready for high-mlp?, James Tuck, Josep Torrellas Why MLP? overlapping long latency misses is a very effective way of tolerating memory latency hide latency at the cost of bandwidth 2 Motivation
More informationAn Approach for Adaptive DRAM Temperature and Power Management
An Approach for Adaptive DRAM Temperature and Power Management Song Liu, Seda Ogrenci Memik, Yu Zhang, and Gokhan Memik Department of Electrical Engineering and Computer Science Northwestern University,
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationExploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture
Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer
More informationEvolution of Rack Scale Architecture Storage
Evolution of Rack Scale Architecture Storage Murugasamy (Sammy) Nachimuthu, Principal Engineer Mohan J Kumar, Fellow Intel Corporation August 2016 1 Agenda Introduction to Intel Rack Scale Design Storage
More informationVirtual Private Caches
Kyle J. Nesbit, James Laudon *, and James E. Smith University of Wisconsin Madison Departmnet of Electrical and Computer Engr. { nesbit, jes }@ece.wisc.edu Sun Microsystems, Inc. * James.laudon@sun.com
More informationThe Software Driven Datacenter
The Software Driven Datacenter Three Major Trends are Driving the Evolution of the Datacenter Hardware Costs Innovation in CPU and Memory. 10000 10 µm CPU process technologies $100 DRAM $/GB 1000 1 µm
More informationUnderstanding Bulldozer architecture through Linpack benchmark
Understanding Bulldozer architecture through Linpack benchmark By Joshua.Mora@amd.com Abstract: AMD has recently introduced a new core architecture named Bulldozer in the newly released multicore processors
More information1.6 Computer Performance
1.6 Computer Performance Performance How do we measure performance? Define Metrics Benchmarking Choose programs to evaluate performance Performance summary Fallacies and Pitfalls How to avoid getting fooled
More informationCluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah ABSTRACT The growing dominance of wire delays at future technology
More informationPerformance, Cost and Amdahl s s Law. Arquitectura de Computadoras
Performance, Cost and Amdahl s s Law Arquitectura de Computadoras Arturo Díaz D PérezP Centro de Investigación n y de Estudios Avanzados del IPN adiaz@cinvestav.mx Arquitectura de Computadoras Performance-
More informationWaveScalar. Winter 2006 CSE WaveScalar 1
WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism traditional coarser-grain parallelism cheap thread management memory ordering enforced through wave-ordered memory Winter 2006 CSE
More informationCSC501 Operating Systems Principles. OS Structure
CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last
More informationPerformance Oriented Prefetching Enhancements Using Commit Stalls
Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationSpeculative Multithreaded Processors
Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads
More informationExploring Wakeup-Free Instruction Scheduling
Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University Outline Motivation Case study: Cyclone Towards high-performance
More informationCache Insertion Policies to Reduce Bus Traffic and Cache Conflicts
Cache Insertion Policies to Reduce Bus Traffic and Cache Conflicts Yoav Etsion Dror G. Feitelson School of Computer Science and Engineering The Hebrew University of Jerusalem 14 Jerusalem, Israel Abstract
More informationSWAP: EFFECTIVE FINE-GRAIN MANAGEMENT
: EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1
More informationResource Containers. A new facility for resource management in server systems. Presented by Uday Ananth. G. Banga, P. Druschel, J. C.
Resource Containers A new facility for resource management in server systems G. Banga, P. Druschel, J. C. Mogul OSDI 1999 Presented by Uday Ananth Lessons in history.. Web servers have become predominantly
More informationAB-Aware: Application Behavior Aware Management of Shared Last Level Caches
AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering
More informationibench: Quantifying Interference in Datacenter Applications
ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization
More informationResource-Conscious Scheduling for Energy Efficiency on Multicore Processors
Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe
More informationfor High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami
3D Implemented dsram/dram HbidC Hybrid Cache Architecture t for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami Kyushu University
More informationmos: An Architecture for Extreme Scale Operating Systems
mos: An Architecture for Extreme Scale Operating Systems Robert W. Wisniewski, Todd Inglett, Pardo Keppel, Ravi Murty, Rolf Riesen Presented by: Robert W. Wisniewski Chief Software Architect Extreme Scale
More informationTowards Energy-Proportional Datacenter Memory with Mobile DRAM
Towards Energy-Proportional Datacenter Memory with Mobile DRAM Krishna Malladi 1 Frank Nothaft 1 Karthika Periyathambi Benjamin Lee 2 Christos Kozyrakis 1 Mark Horowitz 1 Stanford University 1 Duke University
More information