A Framework for Providing Quality of Service in Chip Multi-Processors

Size: px
Start display at page:

Download "A Framework for Providing Quality of Service in Chip Multi-Processors"

Transcription

1 A Framework for Providing Quality of Service in Chip Multi-Processors Fei Guo 1, Yan Solihin 1, Li Zhao 2, Ravishankar Iyer 2 1 North Carolina State University 2 Intel Corporation The 40th Annual IEEE/ACM International Symposium on Microarchitecture 1

2 Background Chip Multi-Processor (CMP) is mainstream architecture Some platform resources (cache, bandwidth) are shared Resource sharing leads to contention Contention may result in a large performance variation Future uses of CMP Run diverse applications with diverse requirements Require performance Quality of Service (QoS) 2

3 Related Work Previous QoS frameworks [Iyer04][Yet05] [Hsu06] [Rafique06] [Nebit07][Iyer07] QoS target specified as IPC or miss rate Resource partitioning (cache, off-chip bandwidth) Resource manager Allocate resource to reach all applications QoS targets Previous QoS frameworks do not fully provide QoS 3

4 Problems with Previous Frameworks 4-core CMP IPC Target IPC Number of concurrent applications (bzip2) QoS targets not met when > 2 jobs run simultaneously CMP cannot check if available resources are sufficient CMP does not know when to reject jobs 4

5 Contributions A framework to provide QoS in a CMP Appropriate QoS target Allowing admission control policy QoS execution modes Important for flexibility and throughput Safe throughput optimization techniques Preserving QoS QoS execution mode downgrade Resource stealing 5

6 Outline QoS target specification QoS execution modes Resource stealing Evaluation Conclusions 6

7 QoS Targets for Individual Jobs Performance Metrics IPC or cache miss rate Resource Usage Metrics (RUM) Cache size, bandwidth rate Easily comparable Foundation for constructing admission control Cannot be ill-defined More familiar to the users 7

8 Timeslot Resource t maximum wall-clock time Borrowed from batch job systems Deadline: latest expected completion time Soft deadline Timeslot specification is optional 8

9 QoS Execution Modes Provide various strictness levels in meeting QoS targets Strict: Rigid implied throughput and deadline requirements Resources and timeslot must be strictly reserved Elastic(X): Rigid deadline requirement Can tolerate throughput deviation (X% max slowdown) Opportunistic: Strong Flexible throughput and deadline requirements Weak 9

10 Strict Mode Downgrade Manual mode downgrade Requires users to change a job s mode to weaker ones Users fully aware of the consequences Automatic mode downgrade Transparent to users Deadlines are preserved Throughput variation tolerable by jobs Elastic(X) ta X ( td ta) tw = tw tw td 10

11 Strict Opp Mode Downgrade Manual mode downgrade Requires users to change a job s mode to weaker ones Users fully aware of the consequences Automatic mode downgrade Transparent to users Deadlines are preserved Throughput variation tolerable by jobs Elastic(X) ta X ( td ta) tw = tw td-tw tw td 11

12 Strict Opp Impact of Mode Downgrade 4-core CMP receives jobs. 6 jobs are illustrated. Elastic(X) Max Wall-clock time deadline 40% of $ External Resource Fragmentation 20% cache and 2 cores are unused Insufficient resources to accept a new job Internal Resource Fragmentation Job does not use all allocated resources 0T 1T 2T 3T 12

13 Impact of Mode Downgrade 4-core CMP receives jobs. 6 jobs are illustrated. Max Wall-clock time deadline 40% of $ External Resource Fragmentation Internal Resource Fragmentation Strict Opp Elastic(X) 0T 1T 2T 3T 13

14 Impact of Mode Downgrade 4-core CMP receives jobs. 6 jobs are illustrated. Max Wall-clock time deadline 40% of $ External Resource Fragmentation Internal Resource Fragmentation Strict Opp Elastic(X) 0T 1T 2T 3T 14

15 Cache Capacity Partitioning Based on a fine-grain per-set partition scheme [Iyer04][Nesbit07] Job specifies number of cache ways Steal cache capacity = steal cache ways 15

16 Resource Stealing (RS) Overview In Elastic(X), X = maximum CPI increase Must know CPI with and without RS Only know one of them at any given time Observation: CPI components are additive CPI + = CPI L2 > h m = L2 miss per instruction t m = average L2 miss latency t m can be kept constant Resource stealing only changes h m h m increase X% => CPI increase < X% h m t m 16

17 Resource Stealing (RS) Overview In Elastic(X), X = maximum CPI increase Must know CPI with and without RS Only know one of them at any given time Observation: CPI components are additive CPI + = CPI L2 > h m = L2 miss per instruction t m = average L2 miss latency t m can be kept constant Resource stealing only changes h m Monitored through duplicate tags h m increase X% => CPI use increase original partitions < X% h m t m 17

18 Evaluation Methodology Simulation environment (based on Simics) 4-core CMP running Fedora Core 4 Linux 2MB 16-way shared L2 cache Applications Selected from 15 SPEC2006 C/C++ benchmarks bzip2 (Highly cache sensitive) hmmer (Moderately cache sensitive) gobmk (Not cache sensitive) 18

19 Evaluation Methodology Workload composition Workload 1: Ten identical jobs Workload 2: Ten mixed jobs Tight deadline (5 jobs), moderate deadline (3 jobs) and relaxed deadline (2 jobs) Execution mode configurations All-Strict (Base) Hybrid: 4 Strict + 3 Elastic(5%) + 3 Opportunistic jobs All-Strict+AutoDown: 10 Strict jobs Opportunistic EqualPart: Cache equally partitioned among cores No admission control 19

20 Impact of Different Modes Fraction of Strict/Elastic(X) jobs that meet deadlines 100% 80% 60% gobmk hmmer bzip2 40% 20% 0% All-Strict Hybrid All Strict+ AutoDown EqualPart EqualPart: most jobs miss deadlines Our schemes: all Strict/Elastic(X) jobs meet deadlines 20

21 Impact of Different Modes Overall job throughput Normalized throughput gobmk hmmer bzip2 All-Strict Hybrid All Strict+ AutoDown EqualPart Strong QoS throughput trade-offs Execution mode variety boosts throughput Auto mode downgrade transparently boosts throughput Moderate/Relaxed deadlines help 21

22 Resource Stealing Impact of performance slack X in Hybrid case (bzip2) Average CPI or Miss Rate Increase of Elasic(X) jobs 25% 20% 15% 10% 5% 0% MissRate Bound CPI 5% 10% 15% 20% Performance Slack X Using duplicate tags is effective CPI increase < miss rate increase Miss rate is a safe proxy 22

23 Mixed-Benchmark Workloads Strict Elastic(5%) Opportunistic Mix-1 hmmer gobmk bzip2 Mix-2 hmmer bzip2 gobmk Mix-1 favorable for resource stealing bzip2 (cache sensitive) is the recipient gobmk (cache insensitive) is the donor Mix-2 not favorable for resource stealing 23

24 Mixed-Benchmark Workloads Overall throughput Normalized Throughput Mix-1 Mix-2 All-Strict Hybrid All Strict+ AutoDown EqualPart In Hybrid, Mix-1 outperforms Mix-2 Mix-1 can boost throughput up to 46% 24

25 Conclusions Appropriate QoS target? Resource Usage Metrics (RUM) Allowing admission control policy Strong QoS throughput trade-offs Throughput can be safely boosted Significantly (13-46%) Through execution mode downgrade Manually Automatically (transparent to users) Resource stealing effective 25

26 Thank You! Presenter: Fei Guo 26

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Improving Real-Time Performance on Multicore Platforms Using MemGuard Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study

More information

Towards Energy Proportionality for Large-Scale Latency-Critical Workloads

Towards Energy Proportionality for Large-Scale Latency-Critical Workloads Towards Energy Proportionality for Large-Scale Latency-Critical Workloads David Lo *, Liqun Cheng *, Rama Govindaraju *, Luiz André Barroso *, Christos Kozyrakis Stanford University * Google Inc. 2012

More information

QoS support for Intelligent Storage Devices

QoS support for Intelligent Storage Devices QoS support for Intelligent Storage Devices Joel Wu Scott Brandt Department of Computer Science University of California Santa Cruz ISW 04 UC Santa Cruz Mixed-Workload Requirement General purpose systems

More information

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore By Dan Stafford Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore Design Space Results & Observations General

More information

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT : EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1

More information

An Analytical Model for Optimum Off- Chip Memory Bandwidth Partitioning in Multi-core Architectures

An Analytical Model for Optimum Off- Chip Memory Bandwidth Partitioning in Multi-core Architectures Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim,Christian Engelmann, and Galen Shipman

More information

Deterministic Memory Abstraction and Supporting Multicore System Architecture

Deterministic Memory Abstraction and Supporting Multicore System Architecture Deterministic Memory Abstraction and Supporting Multicore System Architecture Farzad Farshchi $, Prathap Kumar Valsan^, Renato Mancuso *, Heechul Yun $ $ University of Kansas, ^ Intel, * Boston University

More information

Scheduling the Intel Core i7

Scheduling the Intel Core i7 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne

More information

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications

More information

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads Ran Xu (Purdue), Subrata Mitra (Adobe Research), Jason Rahman (Facebook), Peter Bai (Purdue),

More information

Insights on the performance and configuration of AVB and TSN in automotive applications

Insights on the performance and configuration of AVB and TSN in automotive applications Insights on the performance and configuration of AVB and TSN in automotive applications Nicolas NAVET, University of Luxembourg Josetxo VILLANUEVA, Groupe Renault Jörn MIGGE, RealTime-at-Work (RTaW) Marc

More information

Chip-Multithreading Systems Need A New Operating Systems Scheduler

Chip-Multithreading Systems Need A New Operating Systems Scheduler Chip-Multithreading Systems Need A New Operating Systems Scheduler Alexandra Fedorova Christopher Small Daniel Nussbaum Margo Seltzer Harvard University, Sun Microsystems Sun Microsystems Sun Microsystems

More information

Are You Insured Against Your Noisy Neighbor Sunku Ranganath, Intel Corporation Sridhar Rao, Spirent Communications

Are You Insured Against Your Noisy Neighbor Sunku Ranganath, Intel Corporation Sridhar Rao, Spirent Communications Are You Insured Against Your Noisy Neighbor Sunku Ranganath, Intel Corporation Sridhar Rao, Spirent Communications @SunkuRanganath, @ngignir Legal Disclaimer 2018 Intel Corporation. Intel, the Intel logo,

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this

More information

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto FlexSC Flexible System Call Scheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto Motivation The synchronous system call interface is a legacy from the single

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Optimizing Replication, Communication, and Capacity Allocation in CMPs Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important

More information

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation Metrics, Simulation, and Workloads

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation Metrics, Simulation, and Workloads Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation Metrics, Simulation, and Workloads Copyright 2010 Daniel J. Sorin Duke University Outline Metrics Methodologies Modeling Simulation

More information

TDDD82 Secure Mobile Systems Lecture 6: Quality of Service

TDDD82 Secure Mobile Systems Lecture 6: Quality of Service TDDD82 Secure Mobile Systems Lecture 6: Quality of Service Mikael Asplund Real-time Systems Laboratory Department of Computer and Information Science Linköping University Based on slides by Simin Nadjm-Tehrani

More information

Accelerate Applications Using EqualLogic Arrays with directcache

Accelerate Applications Using EqualLogic Arrays with directcache Accelerate Applications Using EqualLogic Arrays with directcache Abstract This paper demonstrates how combining Fusion iomemory products with directcache software in host servers significantly improves

More information

Hybrid Cache Architecture (HCA) with Disparate Memory Technologies

Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, Yuan Xie Pennsylvania State University IBM Austin Research Laboratory Acknowledgement:

More information

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science Department, University of Pittsburgh Pittsburgh, PA

More information

ibench: Quantifying Interference in Datacenter Applications

ibench: Quantifying Interference in Datacenter Applications ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization

More information

CSE 548 Computer Architecture. Clock Rate vs IPC. V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger. Presented by: Ning Chen

CSE 548 Computer Architecture. Clock Rate vs IPC. V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger. Presented by: Ning Chen CSE 548 Computer Architecture Clock Rate vs IPC V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger Presented by: Ning Chen Transistor Changes Development of silicon fabrication technology caused transistor

More information

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions

More information

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Architecting HBM as a High Bandwidth, High Capacity, Self-Managed Last-Level Cache

Architecting HBM as a High Bandwidth, High Capacity, Self-Managed Last-Level Cache Architecting HBM as a High Bandwidth, High Capacity, Self-Managed Last-Level Cache Tyler Stocksdale Advisor: Frank Mueller Mentor: Mu-Tien Chang Manager: Hongzhong Zheng 11/13/2017 Background Commodity

More information

Emerging NVM Memory Technologies

Emerging NVM Memory Technologies Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement

More information

vcache: Architectural Support for Transparent and Isolated Virtual LLCs in Virtualized Environments

vcache: Architectural Support for Transparent and Isolated Virtual LLCs in Virtualized Environments vcache: Architectural Support for Transparent and Isolated Virtual LLCs in Virtualized Environments Daehoon Kim *, Hwanju Kim, Nam Sung Kim *, and Jaehyuk Huh * University of Illinois at Urbana-Champaign,

More information

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,

More information

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3 Modern CMPs" Intel e5 2600 (2013)! SLLC" AMD Orochi (2012)! SLLC"

More information

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Virtualized ECC: Flexible Reliability in Memory Systems

Virtualized ECC: Flexible Reliability in Memory Systems Virtualized ECC: Flexible Reliability in Memory Systems Doe Hyun Yoon Advisor: Mattan Erez Electrical and Computer Engineering The University of Texas at Austin Motivation Reliability concerns are growing

More information

Advanced Computer Architecture (CS620)

Advanced Computer Architecture (CS620) Advanced Computer Architecture (CS620) Background: Good understanding of computer organization (eg.cs220), basic computer architecture (eg.cs221) and knowledge of probability, statistics and modeling (eg.cs433).

More information

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems 1 Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems Dimitris Kaseridis, Member, IEEE, Muhammad Faisal Iqbal, Student Member, IEEE and Lizy Kurian John,

More information

QoS Policies and Architecture for Cache/Memory in CMP Platforms

QoS Policies and Architecture for Cache/Memory in CMP Platforms QoS Policies and Architecture for Cache/Memory in CMP Platforms Ravi Iyer, Li Zhao, Fei Guo 2, Ramesh Illikkal, Srihari Makineni, Don Newell, Yan Solihin 2, Lisa Hsu 3, Steve Reinhardt 3 Intel Corporation

More information

Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW

Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW 2 SO FAR talked about in-kernel building blocks: threads memory IPC drivers

More information

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

Symphony: An Integrated Multimedia File System

Symphony: An Integrated Multimedia File System Symphony: An Integrated Multimedia File System Prashant J. Shenoy, Pawan Goyal, Sriram S. Rao, and Harrick M. Vin Distributed Multimedia Computing Laboratory Department of Computer Sciences, University

More information

Exploiting Core Criticality for Enhanced GPU Performance

Exploiting Core Criticality for Enhanced GPU Performance Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog, Onur Kayıran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, Chita R. Das. SIGMETRICS 16 Era of Throughput Architectures

More information

Virtual Private Caches

Virtual Private Caches Kyle J. Nesbit, James Laudon *, and James E. Smith University of Wisconsin Madison Departmnet of Electrical and Computer Engr. { nesbit, jes }@ece.wisc.edu Sun Microsystems, Inc. * James.laudon@sun.com

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

A Bandwidth-aware Memory-subsystem Resource Management using. Non-invasive Resource Profilers for Large CMP Systems

A Bandwidth-aware Memory-subsystem Resource Management using. Non-invasive Resource Profilers for Large CMP Systems A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffrey Stuecheli, Jian Chen and Lizy K. John Department of Electrical

More information

Virtual Private Machines: A Resource Abstraction for Multi-Core Computer Systems

Virtual Private Machines: A Resource Abstraction for Multi-Core Computer Systems Virtual Private Machines: A Resource Abstraction for Multi-Core Computer Systems Kyle J. Nesbit University of Wisconsin Madison Department of Electrical and Computer Engineering kjnesbit@ece.wisc.edu James

More information

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,

More information

Response Time and Throughput

Response Time and Throughput Response Time and Throughput Response time How long it takes to do a task Throughput Total work done per unit time e.g., tasks/transactions/ per hour How are response time and throughput affected by Replacing

More information

for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami

for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami 3D Implemented dsram/dram HbidC Hybrid Cache Architecture t for High Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami Kyushu University

More information

Managing GPU Concurrency in Heterogeneous Architectures

Managing GPU Concurrency in Heterogeneous Architectures Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures

More information

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection

More information

Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment

Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment Xin-Wei Shih, Tzu-Hsuan Hsu, Hsu-Chieh Lee, Yao-Wen Chang, Kai-Yuan Chao 2013.01.24 1 Outline 2 Clock Network Synthesis Clock network

More information

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time? The bottom line: Performance Car to Bay Area Speed Passengers Throughput (pmph) Ferrari 3.1 hours 160 mph 2 320 Measuring and Discussing Computer System Performance Greyhound 7.7 hours 65 mph 60 3900 or

More information

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

Towards Energy-Proportional Datacenter Memory with Mobile DRAM Towards Energy-Proportional Datacenter Memory with Mobile DRAM Krishna Malladi 1 Frank Nothaft 1 Karthika Periyathambi Benjamin Lee 2 Christos Kozyrakis 1 Mark Horowitz 1 Stanford University 1 Duke University

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

MARACAS: A Real-Time Multicore VCPU Scheduling Framework

MARACAS: A Real-Time Multicore VCPU Scheduling Framework : A Real-Time Framework Computer Science Department Boston University Overview 1 2 3 4 5 6 7 Motivation platforms are gaining popularity in embedded and real-time systems concurrent workload support less

More information

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency Thanos Makatos, Yannis Klonatos, Manolis Marazakis, Michail D. Flouris, and Angelos Bilas {mcatos,klonatos,maraz,flouris,bilas}@ics.forth.gr

More information

A Brief Compendium of On Chip Memory Highlighting the Tradeoffs Implementing SRAM,

A Brief Compendium of On Chip Memory Highlighting the Tradeoffs Implementing SRAM, A Brief Compendium of On Chip Memory Highlighting the Tradeoffs Implementing, RAM, or edram Justin Bates Department of Electrical and Computer Engineering University of Central Florida Orlando, FL 3816-36

More information

Course web site: teaching/courses/car. Piazza discussion forum:

Course web site:   teaching/courses/car. Piazza discussion forum: Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start

More information

Outline. Emerging Trends. An Integrated Hardware/Software Approach to On-Line Power- Performance Optimization. Conventional Processor Design

Outline. Emerging Trends. An Integrated Hardware/Software Approach to On-Line Power- Performance Optimization. Conventional Processor Design Outline An Integrated Hardware/Software Approach to On-Line Power- Performance Optimization Sandhya Dwarkadas University of Rochester Framework: Dynamically Tunable Clustered Multithreaded Architecture

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

CHOP: Integrating DRAM Caches for CMP Server Platforms. Uliana Navrotska

CHOP: Integrating DRAM Caches for CMP Server Platforms. Uliana Navrotska CHOP: Integrating DRAM Caches for CMP Server Platforms Uliana Navrotska 766785 What I will talk about? Problem: in the era of many-core architecture at the heart of the trouble is the so-called memory

More information

Thesis Defense Lavanya Subramanian

Thesis Defense Lavanya Subramanian Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Thesis Defense Lavanya Subramanian Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel)

More information

PROBABILISTIC SCHEDULING MICHAEL ROITZSCH

PROBABILISTIC SCHEDULING MICHAEL ROITZSCH Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group PROBABILISTIC SCHEDULING MICHAEL ROITZSCH DESKTOP REAL-TIME 2 PROBLEM worst case execution time (WCET) largely exceeds

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers Stavros Volos, Ciprian Seiculescu, Boris Grot, Naser Khosro Pour, Babak Falsafi, and Giovanni De Micheli Toward

More information

No compromises: distributed transactions with consistency, availability, and performance

No compromises: distributed transactions with consistency, availability, and performance No compromises: distributed transactions with consistency, availability, and performance Aleksandar Dragojevi c, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam,

More information

BP-NUCA: CACHE PRESSURE-AWARE MIGRATION FOR HIGH-PERFORMANCE CACHING IN CMPS

BP-NUCA: CACHE PRESSURE-AWARE MIGRATION FOR HIGH-PERFORMANCE CACHING IN CMPS Computing and Informatics, Vol. 3, 211, 137 16 BP-NUCA: CACHE PRESSURE-AWARE MIGRATION FOR HIGH-PERFORMANCE CACHING IN CMPS Xiaomin Jia, Jiang Jiang, Yongwen Wang, Shubo Qi Tianlei Zhao, Guitao Fu, Minxuan

More information

Modification and Evaluation of Linux I/O Schedulers

Modification and Evaluation of Linux I/O Schedulers Modification and Evaluation of Linux I/O Schedulers 1 Asad Naweed, Joe Di Natale, and Sarah J Andrabi University of North Carolina at Chapel Hill Abstract In this paper we present three different Linux

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

Workloads, Scalability and QoS Considerations in CMP Platforms

Workloads, Scalability and QoS Considerations in CMP Platforms Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation Agenda Trends and research context Evolving Workload

More information

An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors

An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors Jiacheng Zhao Institute of Computing Technology, CAS In Conjunction with Prof. Jingling Xue, UNSW, Australia

More information

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Bank-aware Dynamic Cache Partitioning for Multicore Architectures Bank-aware Dynamic Cache Partitioning for Multicore Architectures Dimitris Kaseridis, Jeffrey Stuecheli and Lizy K. John Department of Electrical and Computer Engineering, The University of Texas at Austin,

More information

3D Memory Architecture. Kyushu University

3D Memory Architecture. Kyushu University 3D Memory Architecture Koji Inoue Kyushu University 1 Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Memory Hierarchy Run time Optimization Conclusions 2 Outline Why 3D? Will 3D

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors , July 4-6, 2018, London, U.K. A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid in 3D chip Multi-processors Lei Wang, Fen Ge, Hao Lu, Ning Wu, Ying Zhang, and Fang Zhou Abstract As

More information

Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems. Scott Marshall and Stephen Twigg

Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems. Scott Marshall and Stephen Twigg Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems Scott Marshall and Stephen Twigg 2 Problems with Shared Memory I/O Fairness Memory bandwidth worthless without memory

More information

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Presented by Hadeel Alabandi Introduction and Motivation 2 A serious issue to the effective utilization

More information

8: Scheduling. Scheduling. Mark Handley

8: Scheduling. Scheduling. Mark Handley 8: Scheduling Mark Handley Scheduling On a multiprocessing system, more than one process may be available to run. The task of deciding which process to run next is called scheduling, and is performed by

More information

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core

More information

Bias Scheduling in Heterogeneous Multi-core Architectures

Bias Scheduling in Heterogeneous Multi-core Architectures Bias Scheduling in Heterogeneous Multi-core Architectures David Koufaty Dheeraj Reddy Scott Hahn Intel Labs {david.a.koufaty, dheeraj.reddy, scott.hahn}@intel.com Abstract Heterogeneous architectures that

More information

SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG

SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG 1 WHO ARE THOSE GUYS Accela Zhao, Technologist at EMC OCTO, active Openstack community contributor, experienced in cloud scheduling

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho

Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho 20 th Interna+onal Symposium On High Performance Computer Architecture (HPCA). Orlando, FL, February

More information

Design Considerations for the Symphony Integrated Multimedia File System

Design Considerations for the Symphony Integrated Multimedia File System Design Considerations for the Symphony Integrated Multimedia File System Prashant Shenoy Pawan Goyal Sriram Rao Harrick M. Vin Department of Computer Science, IBM Research Division Department of Computer

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/58 Definition Distributed Systems Distributed System is

More information

[This is not an article, chapter, of conference paper!]

[This is not an article, chapter, of conference paper!] http://www.diva-portal.org [This is not an article, chapter, of conference paper!] Performance Comparison between Scaling of Virtual Machines and Containers using Cassandra NoSQL Database Sogand Shirinbab,

More information

PAC485 Managing Datacenter Resources Using the VirtualCenter Distributed Resource Scheduler

PAC485 Managing Datacenter Resources Using the VirtualCenter Distributed Resource Scheduler PAC485 Managing Datacenter Resources Using the VirtualCenter Distributed Resource Scheduler Carl Waldspurger Principal Engineer, R&D This presentation may contain VMware confidential information. Copyright

More information