Accurate emulation of CPU performance

Size: px
Start display at page:

Download "Accurate emulation of CPU performance"

Transcription

1 Accurate emulation of CPU performance Tomasz Buchert 1 Lucas Nussbaum 2 Jens Gustedt 1 1 INRIA Nancy Grand Est 2 LORIA / Nancy - Université

2 Validation of distributed systems Approaches: Theoretical approach (paper and pencil) the most general results and understanding very hard (leads to unsolvability results) Experimentation (real application on a real environment) realistic context, credibility difficulty of preparation and control, questionable reproducibility Simulation (modeled application inside modeled environment) very simple and perfectly reproducible experimental bias, possibly unrealistic Emulation (real application inside a modeled environment) control over the experiment parameters difficult Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 2 / 20

3 Emulation The perfect emulated environment should emulate (independently): Network bandwidth, latency, topology Performance and number of CPUs Memory capabilities Background noise (network, CPU, faults) Already implemented in Wrekavoc a tool to define and control heterogeneity of the cluster (but not perfect yet!) In this talk, however, we specifically concentrate on Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 3 / 20

4 Emulation The perfect emulated environment should emulate (independently): Network bandwidth, latency, topology Performance and number of CPUs Memory capabilities Background noise (network, CPU, faults) Already implemented in Wrekavoc a tool to define and control heterogeneity of the cluster (but not perfect yet!) In this talk, however, we specifically concentrate on Emulation of CPU Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 3 / 20

5 CPU emulation Various elements of CPU architecture could be emulated: speed number of cores sizes and properties of caches (and topology thereof) memory access speed (especially for NUMA systems) In this talk, we will talk about Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 4 / 20

6 CPU emulation Various elements of CPU architecture could be emulated: speed number of cores sizes and properties of caches (and topology thereof) memory access speed (especially for NUMA systems) In this talk, we will talk about Degradation of CPU speed Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 4 / 20

7 An example Unused Unused Unused Unused 50 % 50 % 70 % 30 % CPU 1 CPU 2 CPU 3 CPU 4 (1) controlling speed of each CPU/core independently Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 5 / 20

8 An example (continued) Unused Unused Unused Unused 50 % 50 % 70 % 30 % CPU 1 CPU 2 CPU 3 CPU 4 (2) being able to create separated scheduling zones Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 6 / 20

9 Dynamic frequency scaling (CPU-Freq) AKA Intel Enhanced SpeedStep or AMD Cool n Quiet Hardware solution to reduce: heat noise power usage For: no overhead of emulation completely unintrusive meaningful CPU time measure Against: only a finite set of different frequency levels Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 7 / 20

10 CPU-Lim Method available in Wrekavoc Algorithm: if CPU usage threshold send SIGSTOP to the process if CPU usage < threshold send SIGCONT to the process CPU usage = CPU time of the process process lifetime For: easy and almost POSIX-compliant Against: intrusive and unscalable decision based on one process instead of global CPU usage sleeping is indistinguishable from preemption Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 8 / 20

11 Fracas Based on idea from KRASH (load injection tool) idea Uses Linux Cgroups and Completely Fair Scheduler A predefined portion of the CPU is given to tasks burning CPU All other processes are given the remaining CPU time CPU burner CPU burner CPU burner Emulated processes Emulated processes Emulated processes Core 1 Core 2 Core 3 Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 9 / 20

12 Fracas Based on idea from KRASH (load injection tool) idea Uses Linux Cgroups and Completely Fair Scheduler A predefined portion of the CPU is given to tasks burning CPU All other processes are given the remaining CPU time For: unintrusive scalable Against: unportable to other systems sensitive to the configuration of the scheduler Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 9 / 20

13 Fracas and latency of the scheduler GFLOP / s ms 1 ms 10 ms 100 ms 1000 ms CPU Frequency [GHz] The smaller the latency, the better the emulation Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 10 / 20

14 Evaluation Based on different types of work: CPU intensive (Linpack benchmark) IO bound multiprocessing multithreading memory speed (STREAM benchmark) X-axis emulated frequency Y-axis speed perceived by the benchmark each test repeated 10 times, results = average 95% confidence interval using t-student distribution Evaluation performed on Grid 5000 platform nodes with two quad-core Intel Xeon X5570 processors nodes with a pair of single-core AMD Opteron 252 processors Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 11 / 20

15 Grid sites, 1600 machines Lille, Rennes, Orsay, Nancy, Bordeaux, Lyon, Grenoble, Toulouse, Sophia Dedicated to research on distributed systems and HPC Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 12 / 20

16 CPU intensive work GFLOP / s CPU-Freq CPU-Lim1 Fracas CPU Frequency [GHz] CPU-Lim is less predictable (the outcome has higher variance) Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 13 / 20

17 IO-bound work 6000 Loops / s CPU-Freq CPU-Lim1 Fracas CPU Frequency [GHz] CPU-Lim gives (unfair) advantage to IO-bound tasks Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 14 / 20

18 Multiprocessing Loops / s CPU-Freq CPU-Lim1 Fracas CPU Frequency [GHz] Fracas can t emulate CPU for multitask computation Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 15 / 20

19 Multithreading Loops / s CPU-Freq CPU-Lim1 Fracas CPU Frequency [GHz] CPU-Lim controls processes instead of scheduling entities Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 16 / 20

20 Memory speed GB / s CPU-Freq CPU-Lim1 Fracas CPU Frequency [GHz] Memory speed is affected differently by each method Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 17 / 20

21 Summary of the evaluation CPU-Freq: very good results coarse granularity CPU-Lim: not scalable due to implementation, intrusive higher variance controls processes, not threads Fracas: good behavior for a single-task workload scalable bad behavior for multitask workload Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 18 / 20

22 Future work Explore other approaches Improve Fracas to cover multitasking Emulate memory bandwidth Emulate other aspects of CPU Integrate Fracas into Wrekavoc Take over the world :) Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 19 / 20

23 Conclusions Presented Fracas, a method for CPU performance emulation based on Linux cgroups Compared with CPU-Freq and CPU-Lim (Wrekavoc) Evaluated experimentally on Grid 5000 None of the methods is perfect: CPU-Freq: coarse grained CPU-Lim: implementation problems, not scalable Fracas: works perfectly in single thread/process case, needs work in multithread/process case Questions? Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 20 / 20

Accurate emulation of CPU performance

Accurate emulation of CPU performance INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE Accurate emulation of CPU performance Tomasz Buchert Lucas Nussbaum Jens Gustedt N 7308 Juin 2010 Distributed and High Performance Computing

More information

Modalis. A First Step to the Evaluation of SimGrid in the Context of a Real Application. Abdou Guermouche and Hélène Renard, May 5, 2010

Modalis. A First Step to the Evaluation of SimGrid in the Context of a Real Application. Abdou Guermouche and Hélène Renard, May 5, 2010 A First Step to the Evaluation of SimGrid in the Context of a Real Application Abdou Guermouche and Hélène Renard, LaBRI/Univ Bordeaux 1 I3S/École polytechnique universitaire de Nice-Sophia Antipolis May

More information

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs Laércio Lima Pilla llpilla@inf.ufrgs.br LIG Laboratory INRIA Grenoble University Grenoble, France Institute of Informatics

More information

Methods for Emulation of Multi-Core CPU Performance

Methods for Emulation of Multi-Core CPU Performance Methods for Emulation of Multi-Core CPU Performance Tomasz Buchert, Lucas Nussbaum, Jens Gustedt To cite this version: Tomasz Buchert, Lucas Nussbaum, Jens Gustedt. Methods for Emulation of Multi-Core

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight

More information

Grid Computing introduction & illustration. T. Gautier V. Danjean

Grid Computing introduction & illustration. T. Gautier V. Danjean Grid Computing introduction & illustration T. Gautier V. Danjean SMAI 26 mai 2009 Facts No choice : parallelism is in any mputer MPSoC, Multi, Many, Cluster, Grid Exact Solution to the Quadratic Assignment

More information

Single-Points of Performance

Single-Points of Performance Single-Points of Performance Mellanox Technologies Inc. 29 Stender Way, Santa Clara, CA 9554 Tel: 48-97-34 Fax: 48-97-343 http://www.mellanox.com High-performance computations are rapidly becoming a critical

More information

NUMA replicated pagecache for Linux

NUMA replicated pagecache for Linux NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience Onkar Patil 1, Saurabh Hukerikar 2, Frank Mueller 1, Christian Engelmann 2 1 Dept. of Computer Science, North Carolina State University

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

The Optimal CPU and Interconnect for an HPC Cluster

The Optimal CPU and Interconnect for an HPC Cluster 5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance

More information

PROPORTIONAL fairness in CPU scheduling mandates

PROPORTIONAL fairness in CPU scheduling mandates IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. XX, MAY 218 1 GVTS: Global Virtual Time Fair Scheduling to Support Strict on Many Cores Changdae Kim, Seungbeom Choi, Jaehyuk Huh, Member,

More information

Best Practices for Setting BIOS Parameters for Performance

Best Practices for Setting BIOS Parameters for Performance White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page

More information

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance S. Moreaud, B. Goglin, D. Goodell, R. Namyst University of Bordeaux RUNTIME team, LaBRI INRIA, France Argonne National Laboratory

More information

A First Step to the Evaluation of SimGrid in the Context of a Real Application. Abdou Guermouche

A First Step to the Evaluation of SimGrid in the Context of a Real Application. Abdou Guermouche A First Step to the Evaluation of SimGrid in the Context of a Real Application Abdou Guermouche Hélène Renard 19th International Heterogeneity in Computing Workshop April 19, 2010 École polytechnique universitaire

More information

Maximizing Memory Performance for ANSYS Simulations

Maximizing Memory Performance for ANSYS Simulations Maximizing Memory Performance for ANSYS Simulations By Alex Pickard, 2018-11-19 Memory or RAM is an important aspect of configuring computers for high performance computing (HPC) simulation work. The performance

More information

Improving locality of an object store in a Fog Computing environment

Improving locality of an object store in a Fog Computing environment Improving locality of an object store in a Fog Computing environment Bastien Confais, Benoît Parrein, Adrien Lebre LS2N, Nantes, France Grid 5000-FIT school 4th April 2018 1/29 Outline 1 Fog computing

More information

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa EPL646: Advanced Topics in Databases Christos Hadjistyllis

More information

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Six-Core AMD Opteron Processor

Six-Core AMD Opteron Processor What s you should know about the Six-Core AMD Opteron Processor (Codenamed Istanbul ) Six-Core AMD Opteron Processor Versatility Six-Core Opteron processors offer an optimal mix of performance, energy

More information

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces Hye-Churn Jang Hyun-Wook (Jin) Jin Department of Computer Science and Engineering Konkuk University Seoul, Korea {comfact,

More information

SILECS Super Infrastructure for Large-scale Experimental Computer Science

SILECS Super Infrastructure for Large-scale Experimental Computer Science Super Infrastructure for Large-scale Experimental Computer Science Serge Fdida (UPMC) Frédéric Desprez (Inria) Christian Perez (Inria) INRIA, CNRS, RENATER, CEA, CPU, CDEFI, IMT, Sorbonne Universite, Universite

More information

Centralized versus distributed schedulers for multiple bag-of-task applications

Centralized versus distributed schedulers for multiple bag-of-task applications Centralized versus distributed schedulers for multiple bag-of-task applications O. Beaumont, L. Carter, J. Ferrante, A. Legrand, L. Marchal and Y. Robert Laboratoire LaBRI, CNRS Bordeaux, France Dept.

More information

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems J.C. Sáez, A. Pousa, F. Castro, D. Chaver y M. Prieto Complutense University of Madrid, Universidad Nacional de la Plata-LIDI

More information

SMD149 - Operating Systems

SMD149 - Operating Systems SMD149 - Operating Systems Roland Parviainen November 3, 2005 1 / 45 Outline Overview 2 / 45 Process (tasks) are necessary for concurrency Instance of a program in execution Next invocation of the program

More information

Capacity Estimation for Linux Workloads. David Boyes Sine Nomine Associates

Capacity Estimation for Linux Workloads. David Boyes Sine Nomine Associates Capacity Estimation for Linux Workloads David Boyes Sine Nomine Associates 1 Agenda General Capacity Planning Issues Virtual Machine History and Value Unique Capacity Issues in Virtual Machines Empirical

More information

Centralized versus distributed schedulers for multiple bag-of-task applications

Centralized versus distributed schedulers for multiple bag-of-task applications Centralized versus distributed schedulers for multiple bag-of-task applications O. Beaumont, L. Carter, J. Ferrante, A. Legrand, L. Marchal and Y. Robert Laboratoire LaBRI, CNRS Bordeaux, France Dept.

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

Monitoring Testbed Experiments with MonEx

Monitoring Testbed Experiments with MonEx Monitoring Testbed Experiments with MonEx Abdulqawi Saif 1,2 Alexandre Merlin 1 Lucas Nussbaum 1 Ye-Qiong Song 1 1 Université de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France 2 Qwant Entreprise,

More information

A NUMA Aware Scheduler for a Parallel Sparse Direct Solver

A NUMA Aware Scheduler for a Parallel Sparse Direct Solver Author manuscript, published in "N/P" A NUMA Aware Scheduler for a Parallel Sparse Direct Solver Mathieu Faverge a, Pierre Ramet a a INRIA Bordeaux - Sud-Ouest & LaBRI, ScAlApplix project, Université Bordeaux

More information

Chapter 18 - Multicore Computers

Chapter 18 - Multicore Computers Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca

More information

Arachne. Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout

Arachne. Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout Arachne Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout Granular Computing Platform Zaharia Winstein Levis Applications Kozyrakis Cluster Scheduling Ousterhout Low-Latency RPC

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest

More information

Networks & protocols research in Grid5000 DAS3

Networks & protocols research in Grid5000 DAS3 1 Grid 5000 Networks & protocols research in Grid5000 DAS3 Date Pascale Vicat-Blanc Primet Senior Researcher at INRIA Leader of the RESO team LIP Laboratory UMR CNRS-INRIA-ENS-UCBL Ecole Normale Supérieure

More information

HETEROGENEOUS MEMORY MANAGEMENT. Linux Plumbers Conference Jérôme Glisse

HETEROGENEOUS MEMORY MANAGEMENT. Linux Plumbers Conference Jérôme Glisse HETEROGENEOUS MEMORY MANAGEMENT Linux Plumbers Conference 2018 Jérôme Glisse EVERYTHING IS A POINTER All data structures rely on pointers, explicitly or implicitly: Explicit in languages like C, C++,...

More information

VMware vsphere 4: The CPU Scheduler in VMware ESX 4 W H I T E P A P E R

VMware vsphere 4: The CPU Scheduler in VMware ESX 4 W H I T E P A P E R VMware vsphere 4: The CPU Scheduler in VMware ESX 4 W H I T E P A P E R Table of Contents 1 Introduction..................................................... 3 2 ESX CPU Scheduler Overview......................................

More information

Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy)

Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy) 2011 NVRAMOS Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy) 2011. 4. 19 Jongmoo Choi http://embedded.dankook.ac.kr/~choijm Contents Overview Motivation Observations Proposal:

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Habanero Operating Committee. January

Habanero Operating Committee. January Habanero Operating Committee January 25 2017 Habanero Overview 1. Execute Nodes 2. Head Nodes 3. Storage 4. Network Execute Nodes Type Quantity Standard 176 High Memory 32 GPU* 14 Total 222 Execute Nodes

More information

Fault tolerance in Grid and Grid 5000

Fault tolerance in Grid and Grid 5000 Fault tolerance in Grid and Grid 5000 Franck Cappello INRIA Director of Grid 5000 fci@lri.fr Fault tolerance in Grid Grid 5000 Applications requiring Fault tolerance in Grid Domains (grid applications

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications Li Tan 1, Zizhong Chen 1, Ziliang Zong 2, Rong Ge 3, and Dong Li 4 1 University of California, Riverside 2 Texas

More information

Exploring the Effects of Hyperthreading on Scientific Applications

Exploring the Effects of Hyperthreading on Scientific Applications Exploring the Effects of Hyperthreading on Scientific Applications by Kent Milfeld milfeld@tacc.utexas.edu edu Kent Milfeld, Chona Guiang, Avijit Purkayastha, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER

More information

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors By: Anvesh Polepalli Raj Muchhala Introduction Integrating CPU and GPU into a single chip for performance

More information

RAIN: Reinvention of RAID for the World of NVMe. Sergey Platonov RAIDIX

RAIN: Reinvention of RAID for the World of NVMe. Sergey Platonov RAIDIX RAIN: Reinvention of RAID for the World of NVMe Sergey Platonov RAIDIX 1 NVMe Market Overview > 15 vendors develop NVMe-compliant servers and appliances > 50% of servers will have NVMe slots by 2020 Market

More information

Validating Wrekavoc: a Tool for Heterogeneity Emulation

Validating Wrekavoc: a Tool for Heterogeneity Emulation Validating Wrekavoc: a Tool for Heterogeneity Emulation Olivier Dubuisson Felix Informatique Laxou, France olivier.dubuisson@free.fr Jens Gustedt INRIA, Nancy Grand Est Villers lès Nancy, France Jens.Gustedt@loria.fr

More information

Understanding vnuma (Virtual Non-Uniform Memory Access)

Understanding vnuma (Virtual Non-Uniform Memory Access) Understanding vnuma (Virtual Non-Uniform Memory Access) SYMETRIC MULTIPROCESSING (SMP) To keep it simple, SMP architecture allows for multiprocessor servers to share a single bus and memory, while being

More information

Wrekavoc: a Tool for Emulating Heterogeneity

Wrekavoc: a Tool for Emulating Heterogeneity Wrekavoc: a Tool for Emulating Heterogeneity Louis-Claude Canon 1 Emmanuel Jeannot 2 1 ESEO 2 Loria INRIA-Lorraine 4 rue Merlet de la Boulaye Campus scientifique BP 30926 54506 Vandœuvre les Nancy 49009

More information

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access S. Moreaud, B. Goglin and R. Namyst INRIA Runtime team-project University of Bordeaux, France Context Multicore architectures everywhere

More information

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1 What s New in VMware vsphere 4.1 Performance VMware vsphere 4.1 T E C H N I C A L W H I T E P A P E R Table of Contents Scalability enhancements....................................................................

More information

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division Intel VTune Amplifier XE Dr. Michael Klemm Software and Services Group Developer Relations Division Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS

More information

THREAD LEVEL PARALLELISM

THREAD LEVEL PARALLELISM THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 4 is due on Dec. 11 th This lecture

More information

Managing Performance Variance of Applications Using Storage I/O Control

Managing Performance Variance of Applications Using Storage I/O Control Performance Study Managing Performance Variance of Applications Using Storage I/O Control VMware vsphere 4.1 Application performance can be impacted when servers contend for I/O resources in a shared storage

More information

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Thread and Data parallelism in CPUs - will GPUs become obsolete? Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für

More information

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018 Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Efficient Parallel Programming on Xeon Phi for Exascale

Efficient Parallel Programming on Xeon Phi for Exascale Efficient Parallel Programming on Xeon Phi for Exascale Eric Petit, Intel IPAG, Seminar at MDLS, Saclay, 29th November 2016 Legal Disclaimers Intel technologies features and benefits depend on system configuration

More information

Comparison of Storage Protocol Performance ESX Server 3.5

Comparison of Storage Protocol Performance ESX Server 3.5 Performance Study Comparison of Storage Protocol Performance ESX Server 3.5 This study provides performance comparisons of various storage connection options available to VMware ESX Server. We used the

More information

BlueGene/L (No. 4 in the Latest Top500 List)

BlueGene/L (No. 4 in the Latest Top500 List) BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside

More information

Scalable and Fault Tolerant Failure Detection and Consensus

Scalable and Fault Tolerant Failure Detection and Consensus EuroMPI'15, Bordeaux, France, September 21-23, 2015 Scalable and Fault Tolerant Failure Detection and Consensus Amogh Katti, Giuseppe Di Fatta, University of Reading, UK Thomas Naughton, Christian Engelmann

More information

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou (  Zhejiang University Operating Systems (Fall/Winter 2018) CPU Scheduling Yajin Zhou (http://yajin.org) Zhejiang University Acknowledgement: some pages are based on the slides from Zhi Wang(fsu). Review Motivation to use threads

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

xsim The Extreme-Scale Simulator

xsim The Extreme-Scale Simulator www.bsc.es xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa Seminar @ BSC, 28 Feb 2014 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING 2/20/13 CS 594: SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING Heike McCraw mccraw@icl.utk.edu 1. Basic Essentials OUTLINE Abstract architecture model Communication, Computation, and Locality

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Energy-centric DVFS Controlling Method for Multi-core Platforms

Energy-centric DVFS Controlling Method for Multi-core Platforms Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To

More information

HPC Issues for DFT Calculations. Adrian Jackson EPCC

HPC Issues for DFT Calculations. Adrian Jackson EPCC HC Issues for DFT Calculations Adrian Jackson ECC Scientific Simulation Simulation fast becoming 4 th pillar of science Observation, Theory, Experimentation, Simulation Explore universe through simulation

More information

Some Visualization Models applied to the Analysis of Parallel Applications

Some Visualization Models applied to the Analysis of Parallel Applications 1 / 1 Some Visualization Models applied to the Analysis of Parallel Applications Lucas Mello Schnorr Advisors: Philippe O. A. Navaux & Denis Trystram & Guillaume Huard Federal University of Rio Grande

More information

Assessing performance in HP LeftHand SANs

Assessing performance in HP LeftHand SANs Assessing performance in HP LeftHand SANs HP LeftHand Starter, Virtualization, and Multi-Site SANs deliver reliable, scalable, and predictable performance White paper Introduction... 2 The advantages of

More information

vsphere Resource Management Update 1 11 JAN 2019 VMware vsphere 6.7 VMware ESXi 6.7 vcenter Server 6.7

vsphere Resource Management Update 1 11 JAN 2019 VMware vsphere 6.7 VMware ESXi 6.7 vcenter Server 6.7 vsphere Resource Management Update 1 11 JAN 2019 VMware vsphere 6.7 VMware ESXi 6.7 vcenter Server 6.7 You can find the most up-to-date technical documentation on the VMware website at: https://docs.vmware.com/

More information

vsphere Resource Management

vsphere Resource Management Update 1 VMware vsphere 6.5 VMware ESXi 6.5 vcenter Server 6.5 This document supports the version of each product listed and supports all subsequent versions until the document is replaced by a new edition.

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Professional Multicore Programming. Design and Implementation for C++ Developers

Professional Multicore Programming. Design and Implementation for C++ Developers Professional Multicore Programming Design and Implementation for C++ Developers Cameron Hughes Tracey Hughes WILEY Wiley Publishing, Inc. Introduction xxi Chapter 1: The New Architecture 1 What Is a Multicore?

More information

GEN_OMEGA2: The HPSUMMARY Procedure: A SAS Macro for Computing the Generalized Omega-Squared Effect Size Associated with

GEN_OMEGA2: The HPSUMMARY Procedure: A SAS Macro for Computing the Generalized Omega-Squared Effect Size Associated with GEN_OMEGA2: A SAS Macro for Computing the Generalized Omega-Squared Effect Size Associated with The HPSUMMARY Procedure: Analysis of Variance Models An Old Friend s Younger (and Brawnier) Cousin The HPSUMMARY

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

On the Capability and Achievable Performance of FPGAs for HPC Applications "On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies

More information

Method-Level Phase Behavior in Java Workloads

Method-Level Phase Behavior in Java Workloads Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

RAIN: Reinvention of RAID for the World of NVMe

RAIN: Reinvention of RAID for the World of NVMe RAIN: Reinvention of RAID for the World of NVMe Dmitrii Smirnov Principal Software Developer smirnov.d@raidix.com RAIDIX LLC 1 About the company RAIDIX is an innovative solution provider and developer

More information

CPU Performance/Power Measurements at the Grid Computing Centre Karlsruhe

CPU Performance/Power Measurements at the Grid Computing Centre Karlsruhe CPU Performance/Power Measurements at the Grid Computing Centre Karlsruhe SPEC Colloquium, Dresden, 2007-06-22 Manfred Alef Forschungszentrum Karlsruhe Institute for Scientific Computing Hermann-von-Helmholtz-Platz

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

AIST Super Green Cloud

AIST Super Green Cloud AIST Super Green Cloud A build-once-run-everywhere high performance computing platform Takahiro Hirofuchi, Ryosei Takano, Yusuke Tanimura, Atsuko Takefusa, and Yoshio Tanaka Information Technology Research

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Managing Hardware Power Saving Modes for High Performance Computing

Managing Hardware Power Saving Modes for High Performance Computing Managing Hardware Power Saving Modes for High Performance Computing Second International Green Computing Conference 2011, Orlando Timo Minartz, Michael Knobloch, Thomas Ludwig, Bernd Mohr timo.minartz@informatik.uni-hamburg.de

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information