Variability in Architectural Simulations of Multi-threaded

Size: px
Start display at page:

Download "Variability in Architectural Simulations of Multi-threaded"

Transcription

1 Variability in Architectural Simulations of Multi-threaded threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

2 Motivation Experimental scientists use statistics Computer architects in simulation experiments don t! Why ignore statistics? Simulations are deterministic This can lead to wrong conclusions! Alaa Alameldeen and David Wood 2

3 Workload Variability Cycles Per Trans. (millions) OLTP DRAM Access latency (ns) Alaa Alameldeen and David Wood 3

4 Workload Variability Cycles Per Trans. (millions) OLTP Slower memory is better! DRAM Access latency (ns) Alaa Alameldeen and David Wood 4

5 What Went Wrong? Many possible executions for each configuration Why? Different timing effects OS scheduling decisions Different orders of lock acquisition Different transaction mixes This is magnified by short simulations Variability can lead to wrong conclusions Alaa Alameldeen and David Wood 5

6 Overview Variability is a real phenomenon for multi- threaded workloads Runs from same initial state can be different Variability is a challenge for simulations Simulations are short Our solution accounts for variability Multiple runs, statistical techniques Alaa Alameldeen and David Wood 6

7 Outline Motivation and Overview Variability in Real Systems Time and Space Variability Variability in Simulations Accounting for Variability Conclusions Alaa Alameldeen and David Wood 7

8 What is Variability? Differences between multiple estimates of a workload s performance Time Variability: Performance changes during different phases of a single run Space Variability: Runs starting from the same state follow different execution paths Alaa Alameldeen and David Wood 8

9 Time Variability in Real Systems Cycles Per Trans. (millions) OLTP One-second intervals Time(sec) Alaa Alameldeen and David Wood 9

10 Time Variability Example (Cont d) How is this handled in real experiments? Cycles Per Trans. (millions) OLTP Solution: Run your experiment long enough! One-minute intervals Time(sec) Alaa Alameldeen and David Wood 10

11 Space Variability in Real Systems Cycles Per Trans. (millions) OLTP One-second averages 5 runs Time(sec) Alaa Alameldeen and David Wood 11

12 Cycles per Trans. (millions) Space Variability Example (Cont d) How is this handled in real experiments? Same Solution: Run your experiment long enough! OLTP One-minute averages 5 runs 16-day simulation Time(sec) Alaa Alameldeen and David Wood 12

13 Outline Motivation and Overview Variability in Real Systems Variability in Simulations Simulation Infrastructure Injecting Randomness The Wrong Conclusion Ratio Accounting for Variability Conclusions Alaa Alameldeen and David Wood 13

14 Workloads Simulation Infrastructure Two scientific and five commercial benchmarks Target System: E10000-like 16-node system Full System Simulation Virtutech Simics running Solaris 8 on SPARC V9 A blocking processor model (Simics) An OoO processor model (TFSim Mauer et al., SIGMETRICS 02) Memory system simulator MOSI invalidation-based broadcast coherence protocol (Martin et al., HPCA-02) Alaa Alameldeen and David Wood 14

15 Simulating Space Variability? Simulations are deterministic Variability cannot be ignored for multi- threaded applications One execution may not be representative Execution paths affect simulation conclusions We need to obtain a space of results Alaa Alameldeen and David Wood 15

16 Injecting Randomness We introduce artificial random perturbations in each simulation run For each memory access, latency in nanoseconds becomes Latency + r (r = -2, -1, 0, 1, 2 nanoseconds, uniform dist.) Roughly models contention due to DMA traffic Other methods are possible Alaa Alameldeen and David Wood 16

17 Simulated Space Variability 1.10 Normalized Runtime Barnes-Hut Ocean ECPerf Slashcode OLTP Benchmark Apache SPECjbb max avg min 20 runs ~10 hrs sim. Space variability exists in our benchmarks Alaa Alameldeen and David Wood 17

18 Quantifying Variability: The Wrong Conclusion Ratio (WCR) Cycles Per Trans. (millions) OLTP WCR (16,32) = 18% WCR (16,64) = 7.5% WCR (32,64) = 26% ROB Size max avg min 20 runs 50 Xacts Alaa Alameldeen and David Wood 18

19 Outline Motivation and Overview Variability in Real Systems Variability in Simulations Accounting for Variability Conclusions Alaa Alameldeen and David Wood 19

20 Definition: Confidence Intervals Range of values expected to include population parameter (e.g. mean) Confidence Probability: Probability that true mean lies inside confidence interval For the same confidence probability: Sample Size Confidence Interval Alaa Alameldeen and David Wood 20

21 Accounting for Space Variability Cycles Per Trans. (millions) OLTP Sample Size (number of runs) Alaa Alameldeen and David Wood 21

22 Accounting for Space Variability Cycles Per Trans. (millions) OLTP Sample Size (number of runs) Simple solution: Estimate #runs such that confidence intervals do not overlap Tests of hypotheses can be used (paper) Alaa Alameldeen and David Wood 22

23 Conclusions Short runs of multi-threaded threaded workloads exhibit variability Variability can lead to wrong simulation conclusions Our Solution: Injecting randomness Multiple runs Apply statistical techniques Alaa Alameldeen and David Wood 23

24 Backup Slides Alaa Alameldeen and David Wood 24

25 Effects of OS Scheduling Same Threads Different Threads 4 L2 Set Size cycles Alaa Alameldeen and David Wood 25

26 WCR Definition Percentage of comparison simulation experiments that reach a wrong conclusion The correct conclusion is the relationship between averages of the two populations WCR can be used to estimate the wrong conclusion probability for single experiments Alaa Alameldeen and David Wood 26

27 Confidence Intervals - Equations The confidence interval for the mean of a normally distributed infinite population: ts y mean y + n ts n Sample Size needed to limit mean relative error to r: n = ts r Y 2 Alaa Alameldeen and David Wood 27

28 Hypothesis Testing Tests whether there is no difference between two population means Hypothesis: µ 32 = µ 64 tests whether the two means of the 32 and 64 ROB configurations are different Hypothesis is tested using sample means and variances If hypothesis rejected Our conclusion is significant Alaa Alameldeen and David Wood 28

29 Accounting for Time Variability Is time variability caused by the same effects that cause space variability? Use Analysis of Variance (ANOVA) If time variability is caused by different effects, we need to obtain a time sample Observations obtained from different starting points Alaa Alameldeen and David Wood 29

30 Multi-threaded threaded Workloads and Simulation Multi-threaded threaded workloads are important Workloads for commercial servers New architectures support multi-threading threading Performance metrics are different from traditional benchmarks Throughput-oriented oriented (transactions) IPC is not appropriate (idle time!) Simulation Challenge: Comparing systems running multi-threaded threaded applications Alaa Alameldeen and David Wood 30

31 Simulation of Multi-threaded threaded Workloads Simulation is slow! We cannot simulate the whole workload Solution: Run for a fixed number of transactions Measure the per-transaction runtime (cycles per transaction) Use to compare different systems Alaa Alameldeen and David Wood 31

Bandwidth Adaptive Snooping

Bandwidth Adaptive Snooping Two classes of multiprocessors Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet Project Computer Sciences Department University of Wisconsin

More information

Full-System Timing-First Simulation

Full-System Timing-First Simulation Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin Madison The Problem Design of future computer systems uses simulation

More information

Evaluating Non-deterministic Multi-threaded Commercial Workloads

Evaluating Non-deterministic Multi-threaded Commercial Workloads Appears in the proceedings of the Computer Architecture Evaluation using Commercial Workloads (CAECW-2) February 2, 22 Evaluating Non-deterministic Multi-threaded Commercial Workloads Alaa R. Alameldeen,

More information

Token Coherence. Milo M. K. Martin Dissertation Defense

Token Coherence. Milo M. K. Martin Dissertation Defense Token Coherence Milo M. K. Martin Dissertation Defense Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin Madison (C) 2003 Milo Martin Overview Technology and software

More information

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation Metrics, Simulation, and Workloads

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation Metrics, Simulation, and Workloads Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation Metrics, Simulation, and Workloads Copyright 2010 Daniel J. Sorin Duke University Outline Metrics Methodologies Modeling Simulation

More information

LogTM: Log-Based Transactional Memory

LogTM: Log-Based Transactional Memory LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood 12th International Symposium on High Performance Computer Architecture () 26 Mulitfacet

More information

This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance

This Unit. CIS 501 Computer Architecture. As You Get Settled. Readings. Metrics Latency and throughput. Reporting performance This Unit CIS 501 Computer Architecture Metrics Latency and throughput Reporting performance Benchmarking and averaging Unit 2: Performance Performance analysis & pitfalls Slides developed by Milo Martin

More information

Real Time: Understanding the Trade-offs Between Determinism and Throughput

Real Time: Understanding the Trade-offs Between Determinism and Throughput Real Time: Understanding the Trade-offs Between Determinism and Throughput Roland Westrelin, Java Real-Time Engineering, Brian Doherty, Java Performance Engineering, Sun Microsystems, Inc TS-5609 Learn

More information

OS Support for Virtualizing Hardware Transactional Memory

OS Support for Virtualizing Hardware Transactional Memory OS Support for Virtualizing Hardware Transactional Memory Michael M. Swift, Haris Volos, Luke Yen, Neelam Goyal, Mark D. Hill and David A. Wood University of Wisconsin Madison The Virtualization Problem

More information

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University

More information

Design Exploration of an Instruction-Based Shared Markov Table on CMPs

Design Exploration of an Instruction-Based Shared Markov Table on CMPs December 18, 23. Final report for CS 838. Design Exploration of an Instruction-Based Shared Markov Table on CMPs Lixin Su & Karthik Ramachandran Department of Electrical and Computer Engineering University

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches

Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Nikos Hardavellas Michael Ferdman, Babak Falsafi, Anastasia Ailamaki Carnegie Mellon and EPFL Data Placement in Distributed

More information

Investigating CMP Synchronization Mechanisms

Investigating CMP Synchronization Mechanisms Investigating CMP Synchronization Mechanisms Koushik Chakraborty kchak@cs.wisc.edu Anu Vaidyanathan vaidyana@cs.wisc.edu CS 838 Dec 19, 2003 Philip Wells pwells@cs.wisc.edu 1 Introduction Synchronization

More information

ECE 588/688 Advanced Computer Architecture II

ECE 588/688 Advanced Computer Architecture II ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Fall 2009 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2009 1 When and Where? When:

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information

Lecture 11: Large Cache Design

Lecture 11: Large Cache Design Lecture 11: Large Cache Design Topics: large cache basics and An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS 02 Distance Associativity for High-Performance

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation

More information

Reproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration

Reproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration Reproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration Cristiano Pereira, Harish Patil, Brad Calder $ Computer Science and Engineering, University of California, San Diego

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture Chapter 2 Note: The slides being presented represent a mix. Some are created by Mark Franklin, Washington University in St. Louis, Dept. of CSE. Many are taken from the Patterson & Hennessy book, Computer

More information

Page Replacement Algorithms

Page Replacement Algorithms Page Replacement Algorithms MIN, OPT (optimal) RANDOM evict random page FIFO (first-in, first-out) give every page equal residency LRU (least-recently used) MRU (most-recently used) 1 9.1 Silberschatz,

More information

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA Motivation Scientific experiments are generating large amounts of data Education

More information

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood Multicast Snooping: A New Coherence Method Using A Multicast Address Ender Bilir, Ross Dickson, Ying Hu, Manoj Plakal, Daniel Sorin, Mark Hill & David Wood Computer Sciences Department University of Wisconsin

More information

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Optimizing Replication, Communication, and Capacity Allocation in CMPs Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important

More information

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance CS 3410 Computer System Organization & Programming [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance Complex question How fast is the processor? How fast your application runs?

More information

Descriptive Statistics, Standard Deviation and Standard Error

Descriptive Statistics, Standard Deviation and Standard Error AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.

More information

The Role of Performance

The Role of Performance Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Closing the Performance Gap Between Volatile and Persistent K-V Stores

Closing the Performance Gap Between Volatile and Persistent K-V Stores Closing the Performance Gap Between Volatile and Persistent K-V Stores Yihe Huang, Harvard University Matej Pavlovic, EPFL Virendra Marathe, Oracle Labs Margo Seltzer, Oracle Labs Tim Harris, Oracle Labs

More information

StressRight: Finding the Right Stress for Accurate In-development System Evaluation

StressRight: Finding the Right Stress for Accurate In-development System Evaluation StressRight: Finding the Right Stress for Accurate In-development System Evaluation Jaewon Lee 1, Hanhwi Jang 1, Jae-eon Jo 1, Gyu-Hyeon Lee 2, Jangwoo Kim 2 High Performance Computing Lab Pohang University

More information

Simulating Server Consolidation

Simulating Server Consolidation 421 A Coruña, 16-18 de septiembre de 2009 Simulating Server Consolidation Antonio García-Guirado, Ricardo Fernández-Pascual, José M. García 1 Abstract Recently, virtualization has become a hot topic in

More information

A Serializability Violation Detector for Shared-Memory Server Programs

A Serializability Violation Detector for Shared-Memory Server Programs A Serializability Violation Detector for Shared-Memory Server Programs Min Xu Rastislav Bodík Mark Hill University of Wisconsin Madison University of California, Berkeley Serializability Violation Detector:

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Chip-Multithreading Systems Need A New Operating Systems Scheduler

Chip-Multithreading Systems Need A New Operating Systems Scheduler Chip-Multithreading Systems Need A New Operating Systems Scheduler Alexandra Fedorova Christopher Small Daniel Nussbaum Margo Seltzer Harvard University, Sun Microsystems Sun Microsystems Sun Microsystems

More information

Performance, Power, Die Yield. CS301 Prof Szajda

Performance, Power, Die Yield. CS301 Prof Szajda Performance, Power, Die Yield CS301 Prof Szajda Administrative HW #1 assigned w Due Wednesday, 9/3 at 5:00 pm Performance Metrics (How do we compare two machines?) What to Measure? Which airplane has the

More information

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm Computer Performance He said, to speed things up we need to squeeze the clock Reread Chapter 1.4-1.9 Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm L15 Computer Performance 1 Why Study Performance?

More information

ProtoFlex: FPGA-Accelerated Hybrid Simulator

ProtoFlex: FPGA-Accelerated Hybrid Simulator ProtoFlex: FPGA-Accelerated Hybrid Simulator Eric S. Chung, Eriko Nurvitadhi James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at Multiprocessor Simulation Simulating one processor in software

More information

Thesis Contributions (Cont.) Question: Does compression help CMP performance? Contribution #3: Evaluate CMP cache and link compression

Thesis Contributions (Cont.) Question: Does compression help CMP performance? Contribution #3: Evaluate CMP cache and link compression Using to Improve Chip Multiprocessor Alaa R. Alameldeen Dissertation Defense Wisconsin Multifacet Project University of Wisconsin-Madison http://www.cs.wisc.edu/multifacet Thesis Contributions (Cont.)

More information

A Customized MVA Model for ILP Multiprocessors

A Customized MVA Model for ILP Multiprocessors A Customized MVA Model for ILP Multiprocessors Daniel J. Sorin, Mary K. Vernon, Vijay S. Pai, Sarita V. Adve, and David A. Wood Computer Sciences Dept University of Wisconsin - Madison sorin, vernon, david

More information

ECE 588/688 Advanced Computer Architecture II

ECE 588/688 Advanced Computer Architecture II ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Winter 2018 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2018 1 When and Where? When:

More information

From Correlation to Causation: Active Delay Injection for Service Dependency Detection

From Correlation to Causation: Active Delay Injection for Service Dependency Detection From Correlation to Causation: Active Delay Injection for Service Dependency Detection Christopher Kruegel Computer Security Group ARO MURI Meeting ICSI, Berkeley, November 15, 2012 Correlation Engine

More information

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:

More information

Synthetic Traffic Generation: a Tool for Dynamic Interconnect Evaluation

Synthetic Traffic Generation: a Tool for Dynamic Interconnect Evaluation Synthetic Traffic Generation: a Tool for Dynamic Interconnect Evaluation W. Heirman, J. Dambre, J. Van Campenhout ELIS Department, Ghent University, Belgium Sponsored by IAP-V PHOTON & IAP-VI photonics@be,

More information

Bandwidth Adaptive Snooping

Bandwidth Adaptive Snooping University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science February 2002 Bandwidth Adaptive Milo Martin University of Pennsylvania, milom@cis.upenn.edu

More information

Dynamic Verification of Sequential Consistency

Dynamic Verification of Sequential Consistency Appears in the 32nd Annual International Symposium on Computer Architecture (ISCA) Madison, Wisconsin, June, 2005 Dynamic Verification of Sequential Consistency Albert Meixner 1 and Daniel J. Sorin 2 1

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

Quantifying Instruction Criticality for Shared Memory Multiprocessors

Quantifying Instruction Criticality for Shared Memory Multiprocessors Appears in the Proceedings of the 15th Symposium on Parallelism in Algorithms and Architectures San Diego, CA, June 7-9, 23 Quantifying Instruction Criticality for Shared Memory Multiprocessors Tong Li

More information

Fuzzy Flow Regulation for Network-on-Chip based Chip Multiprocessor System

Fuzzy Flow Regulation for Network-on-Chip based Chip Multiprocessor System Fuzzy Flow egulation for Network-on-Chip based Chip Multiprocessor System Yuan Yao and Zhonghai Lu KTH oyal Institute of Technology, Stockholm 14 th AS-DAC Conference 19-23, January, 2014, Singapore Outline

More information

Exploiting Core Criticality for Enhanced GPU Performance

Exploiting Core Criticality for Enhanced GPU Performance Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog, Onur Kayıran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, Chita R. Das. SIGMETRICS 16 Era of Throughput Architectures

More information

Understanding Reduced-Voltage Operation in Modern DRAM Devices

Understanding Reduced-Voltage Operation in Modern DRAM Devices Understanding Reduced-Voltage Operation in Modern DRAM Devices Experimental Characterization, Analysis, and Mechanisms Kevin Chang A. Giray Yaglikci, Saugata Ghose,Aditya Agrawal *, Niladrish Chatterjee

More information

Write only as much as necessary. Be brief!

Write only as much as necessary. Be brief! 1 CIS371 Computer Organization and Design Midterm Exam Prof. Martin Thursday, March 15th, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached

More information

Frequent Value Compression in Packet-based NoC Architectures

Frequent Value Compression in Packet-based NoC Architectures Frequent Value Compression in Packet-based NoC Architectures Ping Zhou, BoZhao, YuDu, YiXu, Youtao Zhang, Jun Yang, Li Zhao ECE Department CS Department University of Pittsburgh University of Pittsburgh

More information

Final Lecture. A few minutes to wrap up and add some perspective

Final Lecture. A few minutes to wrap up and add some perspective Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

CS 838 Chip Multiprocessor Prefetching

CS 838 Chip Multiprocessor Prefetching CS 838 Chip Multiprocessor Prefetching Kyle Nesbit and Nick Lindberg Department of Electrical and Computer Engineering University of Wisconsin Madison 1. Introduction Over the past two decades, advances

More information

Abhishek Pandey Aman Chadha Aditya Prakash

Abhishek Pandey Aman Chadha Aditya Prakash Abhishek Pandey Aman Chadha Aditya Prakash System: Building Blocks Motivation: Problem: Determining when to scale down the frequency at runtime is an intricate task. Proposed Solution: Use Machine learning

More information

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University COSC4201 Multiprocessors and Thread Level Parallelism Prof. Mokhtar Aboelaze York University COSC 4201 1 Introduction Why multiprocessor The turning away from the conventional organization came in the

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, ) Performance IC220 Slide Set #5B: Performance (Chapter 1: 1.6, 1.9-1.11) Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational

More information

Case Study 1: Optimizing Cache Performance via Advanced Techniques

Case Study 1: Optimizing Cache Performance via Advanced Techniques 6 Solutions to Case Studies and Exercises Chapter 2 Solutions Case Study 1: Optimizing Cache Performance via Advanced Techniques 2.1 a. Each element is 8B. Since a 64B cacheline has 8 elements, and each

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

WHITE PAPER. Optimizing Virtual Platform Disk Performance

WHITE PAPER. Optimizing Virtual Platform Disk Performance WHITE PAPER Optimizing Virtual Platform Disk Performance Optimizing Virtual Platform Disk Performance 1 The intensified demand for IT network efficiency and lower operating costs has been driving the phenomenal

More information

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Executive Summary Different memory technologies have different

More information

Parallel Streaming Computation on Error-Prone Processors. Yavuz Yetim, Margaret Martonosi, Sharad Malik

Parallel Streaming Computation on Error-Prone Processors. Yavuz Yetim, Margaret Martonosi, Sharad Malik Parallel Streaming Computation on Error-Prone Processors Yavuz Yetim, Margaret Martonosi, Sharad Malik Upsets/B muons/mb Average Number of Dopant Atoms Hardware Errors on the Rise Soft Errors Due to Cosmic

More information

2. Futile Stall HTM HTM HTM. Transactional Memory: TM [1] TM. HTM exponential backoff. magic waiting HTM. futile stall. Hardware Transactional Memory:

2. Futile Stall HTM HTM HTM. Transactional Memory: TM [1] TM. HTM exponential backoff. magic waiting HTM. futile stall. Hardware Transactional Memory: 1 1 1 1 1,a) 1 HTM 2 2 LogTM 72.2% 28.4% 1. Transactional Memory: TM [1] TM Hardware Transactional Memory: 1 Nagoya Institute of Technology, Nagoya, Aichi, 466-8555, Japan a) tsumura@nitech.ac.jp HTM HTM

More information

Datacenter application interference

Datacenter application interference 1 Datacenter application interference CMPs (popular in datacenters) offer increased throughput and reduced power consumption They also increase resource sharing between applications, which can result in

More information

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems : A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University

More information

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair *, Johnathan Alsop^, Sarita V. Adve + * University of Wisconsin-Madison ^ AMD Research + University

More information

Advanced Multimedia Architecture Prof. Cristina Silvano June 2011 Amir Hossein ASHOURI

Advanced Multimedia Architecture Prof. Cristina Silvano June 2011 Amir Hossein ASHOURI Advanced Multimedia Architecture Prof. Cristina Silvano June 2011 Amir Hossein ASHOURI 764722 IBM energy approach policy: One Size Fits All Encompass Software/ Firmware/ Hardware Power7 predecessors features

More information

Clouseau: Probabilistic Dynamic Verification of Multithreaded Memory Systems

Clouseau: Probabilistic Dynamic Verification of Multithreaded Memory Systems Clouseau: Probabilistic Dynamic Verification of Multithreaded Memory Systems Albert Meixner 1 and Daniel J. Sorin 2 1 Department of Computer Science, Duke University 2 Department of Electrical and Computer

More information

SOFT 437. Software Performance Analysis. Ch 7&8:Software Measurement and Instrumentation

SOFT 437. Software Performance Analysis. Ch 7&8:Software Measurement and Instrumentation SOFT 437 Software Performance Analysis Ch 7&8: Why do we need data? Data is required to calculate: Software execution model System execution model We assumed that we have required data to calculate these

More information

Official Agenda Review the most commonly used tools in CMP µarch research. Unofficial Agenda Convince you to use Simics

Official Agenda Review the most commonly used tools in CMP µarch research. Unofficial Agenda Convince you to use Simics Simics and Friends Modeling Tools for CMP Research Zvika Guz, Isask har (Zigi Zigi) Walter The Technion Israel Institute of Technology Official Agenda Agenda Review the most commonly used tools in CMP

More information

LogCA: A High-Level Performance Model for Hardware Accelerators

LogCA: A High-Level Performance Model for Hardware Accelerators Everything should be made as simple as possible, but not simpler Albert Einstein LogCA: A High-Level Performance Model for Hardware Accelerators Muhammad Shoaib Bin Altaf* David A. Wood University of Wisconsin-Madison

More information

It s. slow! SQL Saturday. Copyright Heraflux Technologies. Do not redistribute or copy as your own. 1. Database. Firewall Load Balancer.

It s. slow! SQL Saturday. Copyright Heraflux Technologies. Do not redistribute or copy as your own. 1. Database. Firewall Load Balancer. App request Web Server Firewall Load Balancer Web Server App Server Report Server Desktop App Desktop App Desktop App Desktop App Web Server Database It s FG1 FG2 Log MDF NDF NDF NDF LDF SQL Server Instance

More information

COSC4201 Multiprocessors

COSC4201 Multiprocessors COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Relaxed Memory Consistency

Relaxed Memory Consistency Relaxed Memory Consistency Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Adaptive Prefetching Technique for Shared Virtual Memory

Adaptive Prefetching Technique for Shared Virtual Memory Adaptive Prefetching Technique for Shared Virtual Memory Sang-Kwon Lee Hee-Chul Yun Joonwon Lee Seungryoul Maeng Computer Architecture Laboratory Korea Advanced Institute of Science and Technology 373-1

More information

COMPUTER NETWORK PERFORMANCE. Gaia Maselli Room: 319

COMPUTER NETWORK PERFORMANCE. Gaia Maselli Room: 319 COMPUTER NETWORK PERFORMANCE Gaia Maselli maselli@di.uniroma1.it Room: 319 Computer Networks Performance 2 Overview of first class Practical Info (schedule, exam, readings) Goal of this course Contents

More information

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

System Simulator for x86

System Simulator for x86 MARSS Micro Architecture & System Simulator for x86 CAPS Group @ SUNY Binghamton Presenter Avadh Patel http://marss86.org Present State of Academic Simulators Majority of Academic Simulators: Are for non

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer

More information

Chair for Network Architectures and Services Prof. Carle Department of Computer Science Technische Universität München.

Chair for Network Architectures and Services Prof. Carle Department of Computer Science Technische Universität München. Chair for Network Architectures and Services Prof. Carle Department of Computer Science Technische Universität München Network Analysis 2b) Deterministic Modelling beyond Formal Logic A simple network

More information

Ch. 7: Benchmarks and Performance Tests

Ch. 7: Benchmarks and Performance Tests Ch. 7: Benchmarks and Performance Tests Kenneth Mitchell School of Computing & Engineering, University of Missouri-Kansas City, Kansas City, MO 64110 Kenneth Mitchell, CS & EE dept., SCE, UMKC p. 1/3 Introduction

More information

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems 1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna

More information

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs Authors: Jos e L. Abell an, Juan Fern andez and Manuel E. Acacio Presenter: Guoliang Liu Outline Introduction Motivation Background

More information

Germán Llort

Germán Llort Germán Llort gllort@bsc.es >10k processes + long runs = large traces Blind tracing is not an option Profilers also start presenting issues Can you even store the data? How patient are you? IPDPS - Atlanta,

More information

Speculative Locks. Dept. of Computer Science

Speculative Locks. Dept. of Computer Science Speculative Locks José éf. Martínez and djosep Torrellas Dept. of Computer Science University it of Illinois i at Urbana-Champaign Motivation Lock granularity a trade-off: Fine grain greater concurrency

More information

Memory Consistency and Multiprocessor Performance

Memory Consistency and Multiprocessor Performance Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel

More information

SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery

SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, David A. Wood Computer Sciences Department University

More information

Infrastructure Tuning

Infrastructure Tuning Infrastructure Tuning For SQL Server Performance SQL PASS Performance Virtual Chapter 2014.07.24 About David Klee @kleegeek davidklee.net gplus.to/kleegeek linked.com/a/davidaklee Specialties / Focus Areas

More information

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for oc Modeling in Full-System Simulations Michael K. Papamichael, James C. Hoe, Onur Mutlu papamix@cs.cmu.edu, jhoe@ece.cmu.edu, onur@cmu.edu

More information

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Improving DRAM Performance by Parallelizing Refreshes with Accesses Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes

More information

HPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture

HPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture HPCA 18 Reliability-aware Data Placement for Heterogeneous memory Architecture Manish Gupta Ψ, Vilas Sridharan*, David Roberts*, Andreas Prodromou Ψ, Ashish Venkat Ψ, Dean Tullsen Ψ, Rajesh Gupta Ψ Ψ *

More information

Performance Modeling

Performance Modeling Performance Modeling EECS 489 Computer Networks http://www.eecs.umich.edu/~zmao/eecs489 Z. Morley Mao Tuesday Sept 14, 2004 Acknowledgement: Some slides taken from Kurose&Ross and Katz&Stoica 1 Administrivia

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 18: Directory-Based Cache Protocols John Wawrzynek EECS, University of California at Berkeley http://inst.eecs.berkeley.edu/~cs152 Administrivia 2 Recap:

More information