Workloads, Scalability and QoS Considerations in CMP Platforms

Size: px
Start display at page:

Download "Workloads, Scalability and QoS Considerations in CMP Platforms"

Transcription

1 Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation

2 Agenda Trends and research context Evolving Workload Scenarios Platform Scalability Platform QoS CTG/STL/MPA Review w/ DAP 2

3 Trends - Cores, Threads, and Virtual Machines Today Tomorrow Not too distant future App App App App App App App App OS 1 OS 2 OS 1 OS 2 OS 3 OS n OS VM VMM VM1 VMM VM2 VM1 VM2 VM3 VMn VMM & I/O & I/O & I/O Lots of hardware CTG/STL/MPA threads Review w/ and DAP greater software diversity will challenge cache, 3 memory and I/O

4 Trends - Discussion Points Goal: Scaling up to 100 s of logical processors on a single CPU die Scaling hardware threads means provisioning and re-architecting other platform resources Scaling hardware threads means learning to share QoS revisited CTG/STL/MPA Review w/ DAP 4

5 Trends - Pool of Virtual Resources E.g. Front-End E.g. Mid-Tier E.g. Back-end Application Container Application Container Application Container Workload Workload Workload Workload Workload Workload Workload Workload Guest OS Guest OS Guest OS Virtual Resource Container CPU CPU Virtual Resource Container Virtual Resource Container CPU CPU Virtual Machine Monitor Creates, Deletes and Manages Dynamic Application Containers Processor Virtualization support Processor Resources (Many cores) Processor supported Virtualization Resources Processor supported Virtualization Resources CPU1 CPU1 CPU2 CPU2 CPU3 CPU1 CPU2 CPU3 CPU3 CPU CPU CPU CTG/STL/MPA Review w/ DAP 5

6 Trends - Decomposed OS Container Container Container Container i.e. OLTP Micro OS OS Kernel Service Device Driver OS Service Thread Application Thread Application Messaging interface Legacy OS Processor Virtualization Hypervisor Virtualization Virtualization Processor Resources (Many cores) Resources Resources CPU1 CPU2 CPU3 CPU1 CPU2 CPU3 CPU1 CPU2 CPU3 CPU CPU CPU HW Resources CTG/STL/MPA Review w/ DAP 6

7 Platform Scaling CTG/STL/MPA Review w/ DAP 7

8 Tera-Scale Workload Scenarios Server App SMP O/S Server App Server App Tera-Scale Platform SMP O/S SMP O/S Highly Multi-threaded Server Workloads OLTP, J2EE App Servers, E-commerce, ERP, Search, etc (TPC-C, SPECjbb2005, SAP, SPECjappserver2004, etc) VMM Tera-Scale Platform / Considerations: Performance => Overall Throughput Scalability => Headroom Quality of Service => Performance Isolation Virtualized Server Environments (Workload Consolidation, Datacenter-on-chip Scenarios) Let s begin with single server app.. And follow up with the consolidated scenario. CTG/STL/MPA Review w/ DAP 8

9 Tera-Scale Hierarchy Design Node 0 Node N-1 C0 C1 C2 C3 L1 L1 L1 L1 Mid-Level C0 C1 C2 C3 L1 L1 L1 L1 Mid-Level Small L1 shared between multiple threads on a core On-Die Interconnect Moderate L2 shared between cores within the node LLC 0 LLC 1 LLC N-1 Hierarchical Sharing Benefits -- Reduces Replication -- Increases Effective Size -- Better localized communication -- Reduced interconnect pressure Distributed L3 shared by all nodes within the socket 0 MLC Performance 128K 256K 512K 512k 1M 2M 1 hardware thread - Private 4 hardware threads -Shared Shared MLC is equivalent to 2X of private MLCs datampi codempi Hierarchical sharing increases caching effectiveness significantly CTG/STL/MPA Review w/ DAP 9

10 OLTP Tera-Scale Case Study Core / Node Perfect Last-Level Single socket 32 cores/4 threads Multiple Socket 1 Large Core ~4 Small Cores OUCH OUCH Small core ~50% large core performance on server workloads 2X Performance Potential memory bottleneck Need 100+ GB of memory B/W System Interconnect bottleneck Need 40+ GB of B/W Significant Performance Potential => Scalability Challenges CTG/STL/MPA Review w/ DAP 10

11 Scalability Issues Tera-scale Headroom requirements Start with 32 cores in 1st gen; Maybe grow to 48 cores in next gen? How much memory & interconnect bandwidth will we need? Mem LLC CPU M LLC CPU Sys Interconnect LLC CPU M 32C / skt: GB/s 48C / skt: GB/s H 32C / skt: ~40GB 48C / skt: ~50-70GB H Scaling Issues Just Get Worse CTG/STL/MPA Review w/ DAP 11

12 On-Socket DRAM s (For Scalability) Enable Large Capacity L4s Low Latency High Bandwidth 3D stack Proc DRAM $ MCP Proc DRAM $ Technologies 3D Stacking Multi-chip Packages (MCP) Benefits Significant reduction in miss rate Avoids bandwidth wall Miss Rate (%) 80% 70% 60% 50% 40% 30% 20% 10% 0% Impact of Large DRAM$ (Major Server Workloads) SPECJBB TPCC SPECJAppServer SAP Per Thread Size (MB) CTG/STL/MPA Review w/ DAP 12

13 QoS and Performance Management CTG/STL/MPA Review w/ DAP 13

14 Background for QoS Discussion Trends Increasing Core Count for Performance Increasing Workload Diversity Multi-Workload Scenarios in the Client Virtualization and Consolidation in the Server Heterogeneous Architectures for Graphics Observations Multi-core enables simultaneous execution of multiple workloads But not all Workloads are equal -- users do have preferences How well does the user-preferred application run? Should platforms optimize for the user-preferred application? CTG/STL/MPA Review w/ DAP 14

15 Resource Management Capitalist No management of resources If you can generate more requests, you will use more resources Grab as you will E.g. All of today s policies Elitist Focus on individual efficiency Provide more performance and resources to the VIP Limited resources to non-vip E.g. Service Level Agreements, Foreground/Background Communist/Fair Fair distribution of resources Give equal share of resources to all executing threads Does not necessarily guarantee equal performance E.g. Partitioning resources for fairness and isolation Utilitarian Focus on overall efficiency Give more resource to those that need it the most, less to others E.g. -friendly vs. Unfriendly, resource-aware scheduling CTG/STL/MPA Review w/ DAP 15

16 The Multi-Workload Problem Foreground Office Application Background Image Recognition Execution Time of a foreground application Mem Core F F B B Core Execution Time Significant response time slowdown 50% I/O F Foreground application Running Alone Running with background application Conroe core 2 Duo Measurements Platform does not distinguish in resource allocation Preferred (foreground) application can suffer significant slow down Platform QoS can improve user-preferred application performance CTG/STL/MPA Review w/ DAP 16

17 Contention - Client side examples Runtime Slowdown 5x 4x 3x 2x 1x mcf art equak 30% exhibit 20%-5X slowdown 20% pairs 10-20% slowdown vpr ammp facer bzip2 lucas vorte galge wupwi applu Measured on a Dual core Conroe (4M cache, DDR2 667) Rest exhibit <10% slowdown swim parse mgrid twolf gcc fma3d gap perlb craft apsi mesa sixtr gzip eon mcf art equake bzip2 wupwise swim eon ammp applu appsi crafty facerec fma3d galgel gap gcc gzip lucas mesa mgrid parser perlbmk sixtrack twolf vortex vpr Similar data collected for server applications Corporate Client Example: SPEC Pairs / Contention Resource contention will impair the performance of important apps CTG/STL/MPA Review w/ DAP 17 (Performance Differentiation)

18 Application Behavior & Overall Performance Destructive (streaming) D1 No1 Core Core Core Core Core Core Interference Normal (typical) C1 Constructive (co-op threads) Ne1 Core Core Inefficient Sharing Many Heterogeneous Applications Neutral (little data) Managing resource contention can improve overall throughput too C2 CTG/STL/MPA Review w/ DAP 18 Response Time (Performance Management) Potential to improve overall performance Monitor resource usage and group/partition accordingly 300% 250% 200% 150% 100% 50% 0% Example Benefits Performance Loss No understanding of resource usage behavior Windows Scheduling swim eon gcc mcf twolf gzip Equake gap Resource-Aware Scheduling Clovertown (8Core / 8App) Experiments (used destructive, normal and neutral apps)

19 Service Level Agreements in the Enterprise Server Consolidation OLTP VM Middle Tier VM Network App1 App2 App3 Core Core Core Core Utility SLA Disparate resource usage can cause performance isolation concerns App1 Hi App2 Med App3 Lo Many Different Virtual Machines SLA management Prioritized resource allocation can help address performance isolation and SLA Disparate resource usage and contention hurts SLAs CTG/STL/MPA Review w/ DAP 19 (Need for Performance Isolation and SLA enforcement)

20 Shared Platform Resources Hi Priority Workloads Low Priority Workloads App/VM OS/VMM App/VM OS Visible (Managed Resources) T1 Core T2 T3 T4 HW threads OS Invisible (Unmanaged Resources) CoreCoreCoreCore Shared Microarchitectural resources U-Arch Power Mgmt Coarse Grain Platform Power Management Space Mem BW & Latency Shared Core resources BW & Latency CTG/STL/MPA Review w/ DAP 20

21 Problem Summary CMP Many heterogeneous threads, apps, VMs Not all applications are equal Users have preferences End-users (client) want to treat foreground preferentially End-users (server) want service level differentiation (SLA) or isolation Applications use resources differently Destructive vs. Constructive vs. Neutral Threads No performance management to protect from bad behavior Priority-based OS scheduling no longer sufficient With more cores, OS will allow high and low priority applications to run simultaneously and contend for resources Low priority applications will steal platform resources from high priority apps loss in performance & user experience Platform has no support for application differentiation Platform has no knowledge of preferences or resource usage Platform has no support for fine-grained tracking of many shared resources that have significant performance value CTG/STL/MPA Review w/ DAP 21

22 Platform QoS Software Domain Hardware Domain Feedback Resource Monitoring QoS Exposure HW Policies QoS Hints via Architectural Interface Resource Enforcement QoS Enabled Resources CPU Core Power uarch resource usage Space Bandwidth and Latency I/O Response Time Voltage/ Frequency CTG/STL/MPA Review w/ DAP 22 Goals Preferential Treatment of VIP, Better Overall Throughput

23 Visible QoS Spectrum (/) App 1 App 2 App 3 No monitoring No enforcement Class 1 Class 2 Monitoring Per App* App 1 App 2 App 3 Enforcement Per Class Full monitoring Per App App 1 App 2 App 3 E.g. 2MB, 6GB/s Full enforcement Per App No determinism Classes of Service Guarantee per App No QoS Possible Approach Per-App Partitions No complexity ~ Good resource usage due to greedy approach ~ Okay throughput Incremental Complexity Bridges the gap Extensible Architecture Satisfies Requirements for perf management, perf. differentiation Significant Complexity ~ Lower resource usage due to per VM guarantee ~ Poorer throughput CTG/STL/MPA Review w/ DAP 23

24 QoS Aware Architecture - High Priority Low Priority Set Application s Platform Priority Application Platform Priority QoS Aware OS/VMM: Platform QoS Exposure: Priority added QoS Aware OS/VMM to App state Platform QoS Register App State OS App State QOS Interface Expose QoS Interface Platform Priority Sent through PQR Platform QOS Register Requests tagged with Priority Core 0 Core 1 Resource Monitoring: Monitor cache utilization per application Hi Hi Lo Lo Resource Enforcement: Enforce cache utilization for priority levels CTG/STL/MPA Review w/ DAP 24

25 / QoS Benefits Client SPEC Case Study Response Time (Normalized) QoS Potential ( QoS + QoS) 6.0 x 5.0 x 4.0 x 3.0 x 2.0 x 1.0 x 0.0 x Art (Hi) Swim (Lo) Unmanaged Sharing QoS (BW reservation for Hi) QoS (Lo restricted to 20%) + Mem QoS Dedicated (Best Case) mcf (Hi) Swim (Lo) Response Time -- Lower is better Based on Measurements, Simulations and Analytical Projections Significant Benefits of / QoS CTG/STL/MPA Review w/ DAP 25

26 QoS Aware Architecture: Power High Priority Low Priority Application Platform Priority App State OS App State QOS Interface Power management knobs - Voltage/Frequency scaling - Issue restriction - Gating Core Fast0 Core Slow1 QoS Monitoring,,, Power Hi Platform QOS Register Hi Lo Lo Resource Enforcement#1 Victimize low priority for Power QoS Resource Enforcement#2 Use power throttling to enforce Performance QoS CTG/STL/MPA Review w/ DAP 26

27 Power QoS Benefits High Priority Low Priority 100% Fast Core Freq 100% Slow 39% 89% Core Freq Platform Resources 5x Performance impact with CPU clock throttling User-Preferred App Slowdown 4x 3x 2x 1x mcf 100% freq 89% freq art equak vpr 39% freq Max frequency (100%) to low priority Max slowdown to High Priority Application ammp facer bzip2 Reduced frequency (89%) for low priority -> Reduced Slowdown for Hi Priority lucas vorte galge wupwi applu swim parse mgrid twolf gcc fma3d gap low frequency (39%) to low priority Minimal slowdown for Hi Priority perlb craft apsi mesa sixtr gzip Improves performance of the user-preferred application CTG/STL/MPA Review w/ DAP 27 eon Swim as low priority application

28 Virtualization: From VMs to VPAs (Managing Transparent Resources) Virtual Machine Virtual Platform Virtual Machine Architecture Virtual Machine Workload Workload Workload Workload Workload Workload Workload Workload Workload Guest OS Guest OS Guest OS Virtual Resource Container CPU Capacity Virtual Resource Container Virtual Resource Container CPU Capacity Virtual Machine Monitor manages CPU, and memory capacity Virtual Platform Architectures manages performance with PQoS Processor Virtualization support Processor supported Virtualization / Allocation Power Management Processor Resources (Many cores) CPU1 CPU1 CPU2 CPU2 CPU3 CPU1 CPU2 CPU3 CPU CPU CPU3 CPU Resources Resources Performance Isolation / SLA Differentiation motivates B/W Capcity Capacity Capacity B/W Capacity B/W Virtual Platform Architectures V/f, Instruction & Issue Throttling, Average Average Power Power CTG/STL/MPA Review w/ DAP 28

29 Summary Large-scale CMP is going to happen Lots of work to be done to identify and remove platform and architectural limitations preventing applications and execution environments from scaling up to 100 s of logical processors on a single CPU die Scalability concerns can be addressed Hierarchy of Shared s Large DRAM caches QoS concerns can be addressed Dynamic Allocation ( QoS) Dynamic Power Management (Power QoS) Smart performance management requires more visibility be available to the execution environment Resource utilization counters for schedulers, etc. CTG/STL/MPA Review w/ DAP 29

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background

More information

CacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms

CacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms CacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms Li Zhao, Ravi Iyer, Ramesh Illikkal, Jaideep Moses, Srihari Makineni and Don Newell System Technology Lab, Intel Corporation, Hillsboro,

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N. Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical

More information

QoS Policies and Architecture for Cache/Memory in CMP Platforms

QoS Policies and Architecture for Cache/Memory in CMP Platforms QoS Policies and Architecture for Cache/Memory in CMP Platforms Ravi Iyer, Li Zhao, Fei Guo 2, Ramesh Illikkal, Srihari Makineni, Don Newell, Yan Solihin 2, Lisa Hsu 3, Steve Reinhardt 3 Intel Corporation

More information

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,

More information

No Tradeoff Low Latency + High Efficiency

No Tradeoff Low Latency + High Efficiency No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS),

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications

More information

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn A Cross-Architectural Interface for Code Cache Manipulation Kim Hazelwood and Robert Cohn Software-Managed Code Caches Software-managed code caches store transformed code at run time to amortize overhead

More information

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1 I Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL Inserting Prefetches IA-32 Execution Layer - 1 Agenda IA-32EL Brief Overview Prefetching in Loops IA-32EL Prefetching in

More information

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system

More information

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University

More information

Understanding Cache Interference

Understanding Cache Interference Understanding Cache Interference by M.W.A. Settle B.A., University of Colorado, 1996 M.S., University of Colorado, 2001 A thesis submitted to the Faculty of the Graduate School of the University of Colorado

More information

Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs

Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs University of Maryland Technical Report UMIACS-TR-2008-13 Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs Wanli Liu and Donald Yeung Department of Electrical and Computer Engineering

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 41 Performance II CS61C L41 Performance II (1) Lecturer PSOE Dan Garcia www.cs.berkeley.edu/~ddgarcia UWB Ultra Wide Band! The FCC moved

More information

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading CSE 502 Graduate Computer Architecture Lec 11 Simultaneous Multithreading Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

Decoupled Zero-Compressed Memory

Decoupled Zero-Compressed Memory Decoupled Zero-Compressed Julien Dusser julien.dusser@inria.fr André Seznec andre.seznec@inria.fr Centre de recherche INRIA Rennes Bretagne Atlantique Campus de Beaulieu, 3542 Rennes Cedex, France Abstract

More information

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager 1 T HE D E N A L I N E X T - G E N E R A T I O N H I G H - D E N S I T Y S T O R A G E I N T E R F A C E Laura Caulfield Senior Software Engineer Arie van der Hoeven Principal Program Manager Outline Technology

More information

Thesis Contributions (Cont.) Question: Does compression help CMP performance? Contribution #3: Evaluate CMP cache and link compression

Thesis Contributions (Cont.) Question: Does compression help CMP performance? Contribution #3: Evaluate CMP cache and link compression Using to Improve Chip Multiprocessor Alaa R. Alameldeen Dissertation Defense Wisconsin Multifacet Project University of Wisconsin-Madison http://www.cs.wisc.edu/multifacet Thesis Contributions (Cont.)

More information

Computer System. Performance

Computer System. Performance Computer System Performance Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/

More information

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Noname manuscript No. (will be inserted by the editor) The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham Received:

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Bank-aware Dynamic Cache Partitioning for Multicore Architectures Bank-aware Dynamic Cache Partitioning for Multicore Architectures Dimitris Kaseridis, Jeffrey Stuecheli and Lizy K. John Department of Electrical and Computer Engineering, The University of Texas at Austin,

More information

High Performance Memory Requests Scheduling Technique for Multicore Processors

High Performance Memory Requests Scheduling Technique for Multicore Processors High Performance Memory Requests Scheduling Technique for Multicore Processors Walid El-Reedy Electronics and Comm. Engineering Cairo University, Cairo, Egypt walid.elreedy@gmail.com Ali A. El-Moursy Electrical

More information

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin Sangyeun Cho Department of Computer Science University of Pittsburgh jinlei,cho@cs.pitt.edu Abstract Private

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

A Power and Temperature Aware DRAM Architecture

A Power and Temperature Aware DRAM Architecture A Power and Temperature Aware DRAM Architecture Song Liu, Seda Ogrenci Memik, Yu Zhang, and Gokhan Memik Department of Electrical Engineering and Computer Science Northwestern University, Evanston, IL

More information

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Statistical Simulation of Chip Multiprocessors Running Multi-Program Workloads

Statistical Simulation of Chip Multiprocessors Running Multi-Program Workloads Statistical Simulation of Chip Multiprocessors Running Multi-Program Workloads Davy Genbrugge Lieven Eeckhout ELIS Depment, Ghent University, Belgium Email: {dgenbrug,leeckhou}@elis.ugent.be Abstract This

More information

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

QoS support for Intelligent Storage Devices

QoS support for Intelligent Storage Devices QoS support for Intelligent Storage Devices Joel Wu Scott Brandt Department of Computer Science University of California Santa Cruz ISW 04 UC Santa Cruz Mixed-Workload Requirement General purpose systems

More information

Exploiting Streams in Instruction and Data Address Trace Compression

Exploiting Streams in Instruction and Data Address Trace Compression Exploiting Streams in Instruction and Data Address Trace Compression Aleksandar Milenkovi, Milena Milenkovi Electrical and Computer Engineering Dept., The University of Alabama in Huntsville Email: {milenka

More information

Reducing Overhead for Soft Error Coverage in High Availability Systems

Reducing Overhead for Soft Error Coverage in High Availability Systems Reducing Overhead for Soft Error Coverage in High Availability Systems Nidhi Aggarwal, Norman P. Jouppi, James E. Smith Parthasarathy Ranganathan and Kewal K. Saluja Abstract High reliability/availability

More information

Using Aggressor Thread Information to Improve Shared Cache Management for CMPs

Using Aggressor Thread Information to Improve Shared Cache Management for CMPs Appears in Proc. of the 18th Int l Conf. on Parallel Architectures and Compilation Techniques. Raleigh, NC. Sept. 2009. Using Aggressor Thread Information to Improve Shared Cache Management for CMPs Wanli

More information

qtlb: Looking inside the Look-aside buffer

qtlb: Looking inside the Look-aside buffer qtlb: Looking inside the Look-aside buffer Omesh Tickoo 1, Hari Kannan 2, Vineet Chadha 3, Ramesh Illikkal 1, Ravi Iyer 1, and Donald Newell 1 1 Intel Corporation, 2111 NE 25th Ave., Hillsboro OR, USA,

More information

3D Memory Architecture. Kyushu University

3D Memory Architecture. Kyushu University 3D Memory Architecture Koji Inoue Kyushu University 1 Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Memory Hierarchy Run time Optimization Conclusions 2 Outline Why 3D? Will 3D

More information

Parallel Computing 38 (2012) Contents lists available at SciVerse ScienceDirect. Parallel Computing

Parallel Computing 38 (2012) Contents lists available at SciVerse ScienceDirect. Parallel Computing Parallel Computing 38 (2012) 533 551 Contents lists available at SciVerse ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Algorithm-level Feedback-controlled Adaptive data

More information

Virtual Private Machines: A Resource Abstraction for Multi-Core Computer Systems

Virtual Private Machines: A Resource Abstraction for Multi-Core Computer Systems Virtual Private Machines: A Resource Abstraction for Multi-Core Computer Systems Kyle J. Nesbit University of Wisconsin Madison Department of Electrical and Computer Engineering kjnesbit@ece.wisc.edu James

More information

Dynamically Controlled Resource Allocation in SMT Processors

Dynamically Controlled Resource Allocation in SMT Processors Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona

More information

Staged Tuning: A Hybrid (Compile/Install-time) Technique for Improving Utilization of Performance-asymmetric Multicores

Staged Tuning: A Hybrid (Compile/Install-time) Technique for Improving Utilization of Performance-asymmetric Multicores Computer Science Technical Reports Computer Science Summer 6-29-2015 Staged Tuning: A Hybrid (Compile/Install-time) Technique for Improving Utilization of Performance-asymmetric Multicores Tyler Sondag

More information

Performance analysis of Intel Core 2 Duo processor

Performance analysis of Intel Core 2 Duo processor Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 27 Performance analysis of Intel Core 2 Duo processor Tribuvan Kumar Prakash Louisiana State University and Agricultural

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT

More information

The V-Way Cache : Demand-Based Associativity via Global Replacement

The V-Way Cache : Demand-Based Associativity via Global Replacement The V-Way Cache : Demand-Based Associativity via Global Replacement Moinuddin K. Qureshi David Thompson Yale N. Patt Department of Electrical and Computer Engineering The University of Texas at Austin

More information

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor Boost Sequential Program Performance Using A Virtual Large Instruction Window on Chip Multicore Processor Liqiang He Inner Mongolia University Huhhot, Inner Mongolia 010021 P.R.China liqiang@imu.edu.cn

More information

Architecture Cloning For PowerPC Processors. Edwin Chan, Raul Silvera, Roch Archambault IBM Toronto Lab Oct 17 th, 2005

Architecture Cloning For PowerPC Processors. Edwin Chan, Raul Silvera, Roch Archambault IBM Toronto Lab Oct 17 th, 2005 Architecture Cloning For PowerPC Processors Edwin Chan, Raul Silvera, Roch Archambault edwinc@ca.ibm.com IBM Toronto Lab Oct 17 th, 2005 Outline Motivation Implementation Details Results Scenario Previously,

More information

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012 Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso

More information

Precise Instruction Scheduling

Precise Instruction Scheduling Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University

More information

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory Dhananjoy Das, Sr. Systems Architect SanDisk Corp. 1 Agenda: Applications are KING! Storage landscape (Flash / NVM)

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Computer Science 246. Computer Architecture

Computer Science 246. Computer Architecture Computer Architecture Spring 2010 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture Outline Performance Metrics Averaging Amdahl s Law Benchmarks The CPU Performance Equation Optimal

More information

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi Chang Joo Lee Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The

More information

HASS: A Scheduler for Heterogeneous Multicore Systems

HASS: A Scheduler for Heterogeneous Multicore Systems HASS: A Scheduler for Heterogeneous Multicore Systems Daniel Shelepov 1 dsa5@cs.sfu.ca Alexandra Fedorova 1 fedorova@cs.sfu.ca Sergey Blagodurov 1 sba70@cs.sfu.ca 1 Simon Fraser University 8888 University

More information

Many Cores, One Thread: Dean Tullsen University of California, San Diego

Many Cores, One Thread: Dean Tullsen University of California, San Diego Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and

More information

Best Practices for Setting BIOS Parameters for Performance

Best Practices for Setting BIOS Parameters for Performance White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page

More information

Multicore Computing and the Cloud Optimizing Systems with Virtualization

Multicore Computing and the Cloud Optimizing Systems with Virtualization Multicore Computing and the Cloud Optimizing Systems with Virtualization Michael Gschwind IBM Systems and Technology Group IBM Corporation Major Forces Are Driving the Need For IT Transformation Operational

More information

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain Software-assisted Cache Mechanisms for Embedded Systems by Prabhat Jain Bachelor of Engineering in Computer Engineering Devi Ahilya University, 1986 Master of Technology in Computer and Information Technology

More information

Distributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1

Distributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1 Distributed File Systems Issues NFS (Network File System) Naming and transparency (location transparency versus location independence) Host:local-name Attach remote directories (mount) Single global name

More information

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware 2010 VMware Inc. All rights reserved About the Speaker Hemant Gaidhani Senior Technical

More information

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1 What s New in VMware vsphere 4.1 Performance VMware vsphere 4.1 T E C H N I C A L W H I T E P A P E R Table of Contents Scalability enhancements....................................................................

More information

Intel Enterprise Processors Technology

Intel Enterprise Processors Technology Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

VIRTUALIZING SERVER CONNECTIVITY IN THE CLOUD

VIRTUALIZING SERVER CONNECTIVITY IN THE CLOUD VIRTUALIZING SERVER CONNECTIVITY IN THE CLOUD Truls Myklebust Director, Product Management Brocade Communications 2011 Brocade Communciations - All Rights Reserved 13 October 2011 THE ENTERPRISE IS GOING

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders

Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders Chuanjun Zhang Department of Computer Science and Electrical Engineering University of Missouri-Kansas City

More information

Instruction Based Memory Distance Analysis and its Application to Optimization

Instruction Based Memory Distance Analysis and its Application to Optimization Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang cfang@mtu.edu Steve Carr carr@mtu.edu Soner Önder soner@mtu.edu Department of Computer Science Michigan Technological

More information

Are we ready for high-mlp?

Are we ready for high-mlp? Are we ready for high-mlp?, James Tuck, Josep Torrellas Why MLP? overlapping long latency misses is a very effective way of tolerating memory latency hide latency at the cost of bandwidth 2 Motivation

More information

An Approach for Adaptive DRAM Temperature and Power Management

An Approach for Adaptive DRAM Temperature and Power Management An Approach for Adaptive DRAM Temperature and Power Management Song Liu, Seda Ogrenci Memik, Yu Zhang, and Gokhan Memik Department of Electrical Engineering and Computer Science Northwestern University,

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer

More information

Evolution of Rack Scale Architecture Storage

Evolution of Rack Scale Architecture Storage Evolution of Rack Scale Architecture Storage Murugasamy (Sammy) Nachimuthu, Principal Engineer Mohan J Kumar, Fellow Intel Corporation August 2016 1 Agenda Introduction to Intel Rack Scale Design Storage

More information

Virtual Private Caches

Virtual Private Caches Kyle J. Nesbit, James Laudon *, and James E. Smith University of Wisconsin Madison Departmnet of Electrical and Computer Engr. { nesbit, jes }@ece.wisc.edu Sun Microsystems, Inc. * James.laudon@sun.com

More information

The Software Driven Datacenter

The Software Driven Datacenter The Software Driven Datacenter Three Major Trends are Driving the Evolution of the Datacenter Hardware Costs Innovation in CPU and Memory. 10000 10 µm CPU process technologies $100 DRAM $/GB 1000 1 µm

More information

Understanding Bulldozer architecture through Linpack benchmark

Understanding Bulldozer architecture through Linpack benchmark Understanding Bulldozer architecture through Linpack benchmark By Joshua.Mora@amd.com Abstract: AMD has recently introduced a new core architecture named Bulldozer in the newly released multicore processors

More information

1.6 Computer Performance

1.6 Computer Performance 1.6 Computer Performance Performance How do we measure performance? Define Metrics Benchmarking Choose programs to evaluate performance Performance summary Fallacies and Pitfalls How to avoid getting fooled

More information

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah ABSTRACT The growing dominance of wire delays at future technology

More information

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras Performance, Cost and Amdahl s s Law Arquitectura de Computadoras Arturo Díaz D PérezP Centro de Investigación n y de Estudios Avanzados del IPN adiaz@cinvestav.mx Arquitectura de Computadoras Performance-

More information

WaveScalar. Winter 2006 CSE WaveScalar 1

WaveScalar. Winter 2006 CSE WaveScalar 1 WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism traditional coarser-grain parallelism cheap thread management memory ordering enforced through wave-ordered memory Winter 2006 CSE

More information

CSC501 Operating Systems Principles. OS Structure

CSC501 Operating Systems Principles. OS Structure CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last

More information

Performance Oriented Prefetching Enhancements Using Commit Stalls

Performance Oriented Prefetching Enhancements Using Commit Stalls Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Exploring Wakeup-Free Instruction Scheduling

Exploring Wakeup-Free Instruction Scheduling Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University Outline Motivation Case study: Cyclone Towards high-performance

More information

Cache Insertion Policies to Reduce Bus Traffic and Cache Conflicts

Cache Insertion Policies to Reduce Bus Traffic and Cache Conflicts Cache Insertion Policies to Reduce Bus Traffic and Cache Conflicts Yoav Etsion Dror G. Feitelson School of Computer Science and Engineering The Hebrew University of Jerusalem 14 Jerusalem, Israel Abstract

More information

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT : EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1

More information

Resource Containers. A new facility for resource management in server systems. Presented by Uday Ananth. G. Banga, P. Druschel, J. C.

Resource Containers. A new facility for resource management in server systems. Presented by Uday Ananth. G. Banga, P. Druschel, J. C. Resource Containers A new facility for resource management in server systems G. Banga, P. Druschel, J. C. Mogul OSDI 1999 Presented by Uday Ananth Lessons in history.. Web servers have become predominantly

More information

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering

More information

ibench: Quantifying Interference in Datacenter Applications

ibench: Quantifying Interference in Datacenter Applications ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami

for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami 3D Implemented dsram/dram HbidC Hybrid Cache Architecture t for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami Kyushu University

More information

mos: An Architecture for Extreme Scale Operating Systems

mos: An Architecture for Extreme Scale Operating Systems mos: An Architecture for Extreme Scale Operating Systems Robert W. Wisniewski, Todd Inglett, Pardo Keppel, Ravi Murty, Rolf Riesen Presented by: Robert W. Wisniewski Chief Software Architect Extreme Scale

More information

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

Towards Energy-Proportional Datacenter Memory with Mobile DRAM Towards Energy-Proportional Datacenter Memory with Mobile DRAM Krishna Malladi 1 Frank Nothaft 1 Karthika Periyathambi Benjamin Lee 2 Christos Kozyrakis 1 Mark Horowitz 1 Stanford University 1 Duke University

More information