LoGA: Low-Overhead GPU Accounting Using Events

Size: px
Start display at page:

Download "LoGA: Low-Overhead GPU Accounting Using Events"

Transcription

1 10th ACM International Systems and Storage Conference, (KIT) KIT The Research University in the Helmholtz Association

2 GPU Sharing GPUs increasingly popular in computing Not every application saturates a GPU time Move GPUs to the cloud Sharing increases cost-efficiency Problem: Need fairness Software scheduling is inefficient 2

3 NEON (University of Rochester, ASPLOS '14) Applies fair queuing to GPUs NEON 3

4 NEON (University of Rochester, ASPLOS '14) Applies fair queuing to GPUs NEON 4

5 NEON (University of Rochester, ASPLOS '14) Applies fair queuing to GPUs NEON 5

6 NEON (University of Rochester, ASPLOS '14) Applies fair queuing to GPUs NEON 6

7 NEON (University of Rochester, ASPLOS '14) Applies fair queuing to GPUs NEON Threshold 7

8 NEON (University of Rochester, ASPLOS '14) Applies fair queuing to GPUs NEON STOP 8

9 NEON: Accounting Problem NEON s accounting disables GPU access GPU idle! STOP Freerun Phase Sampling Phase High accounting overhead if application does not saturate the GPU 9

10 Idea GPUs have lots of status registers for the device driver Leak information about GPU s internal state Reading has no effect on GPU driver or running application Idea: Poll status registers Infer GPU-internal context switches 10

11 Design App STOP Submit Query Poll 11 Scheduler Accounting thread

12 Accounting Thread Accounting thread Poll Status reg 12

13 Accounting Thread Accounting thread Poll Status reg 13

14 Accounting Thread Accounting thread Poll Status reg Poll frequency must be faster than kernel length High CPU load Poll periodically 14

15 Scheduling Scheduling thread queries accounting thread Updates fair queuing counters Stops applications if necessary Different metric than NEON NEON: Average kernel length LoGA: Total GPU time consumed 15

16 Benchmark Scenario Accounting overhead Accuracy of accounting ( scheduling quality) Competing workload: Throttle Creates well-defined GPU load run 16 sleep

17 Accounting Overhead (10% load) 1,6 10 instances of throttle, 1% load each Normalized runtime 1,5 Normalized against execution time w/o accounting 1,4 1,3 Scheduling disabled 1,2 1,1 NEON LoGA 1 0,9 0,8 17 nw cl ef pa ilte th r fin d sr er ad _v 1 sr st ad re am _v2 cl us G eo ter m ea n pa rti nn lu m d er g m pu yo cy te m um cf d dw ga t2d us s he ian ar tw a ho ll ts ho po t ts po t3 hu D ffm hy a br n id so k m rt ea n la s va le MD uk oc yt e bf s b+ tre e ba ck pr op 0,7

18 Accounting Overhead (90% load) 1,6 10 instances of throttle, 9% load each Normalized runtime 1,5 Normalized against execution time w/o accounting 1,4 1,3 Scheduling disabled 1,2 1,1 NEON LoGA 1 0,9 0,8 18 nw cl ef pa ilte th r fin d sr er ad _v 1 sr st ad re am _v2 cl us G eo ter m ea n pa rti nn lu m d er g m pu yo cy te m um cf d dw ga t2d us s he ian ar tw a ho ll ts ho po t ts po t3 hu D ffm hy a br n id so k m rt ea n la s va le MD uk oc yt e bf s b+ tre e ba ck pr op 0,7

19 Fairness Goal: LoGA should not reduce fairness Problem: Which application runtime is fair? Measure total GPU time for each application Calculate optimal scheduling Next slide: Speedup over fair schedule 19

20 Fairness: Results (4x throttle, 20% load each) 5 Speedup over fair schedule 4,5 4 3,5 3 2,5 No scheduling NEON LoGA LoGA-CL 2 1,5 1 0,5 20 km ea ns so rt hy br id an hu ffm ho ts po t ho ts po t3 D he ar tw al l ia n ga us s dw t2 d cf d b+ tre e bf s ba ck pr op 0

21 Conclusion Sharing beneficial if applications do not saturate the GPU Scheduler interference reduces sharing LoGA accounts GPU usage without overhead Poll GPU status registers, detect context switches Time between context switches as input for fair queuing 21

22 Finding Registers Envytools project has documented some registers Unfortunately, far from complete Trial and error: Run workload with known behavior Dump register values See which registers correlate with workload behavior Two registers identified: ID of currently running GPU context Activity status of entire GPU 22

GPrioSwap: Towards a Swapping Policy for GPUs

GPrioSwap: Towards a Swapping Policy for GPUs Jens Kehne, Jonathan Metter, Martin Merkel, Marius Hillenbrand, Marc Rittinghaus, Frank Bellosa 0th ACM International Systems and Storage Conference, (KIT) KIT The Research University in the Helmholtz

More information

LoGA: Low-overhead GPU accounting using events

LoGA: Low-overhead GPU accounting using events : Low-overhead GPU accounting using events Jens Kehne Stanislav Spassov Marius Hillenbrand Marc Rittinghaus Frank Bellosa Karlsruhe Institute of Technology (KIT) Operating Systems Group os@itec.kit.edu

More information

ENDF/B-VII.1 versus ENDFB/-VII.0: What s Different?

ENDF/B-VII.1 versus ENDFB/-VII.0: What s Different? LLNL-TR-548633 ENDF/B-VII.1 versus ENDFB/-VII.0: What s Different? by Dermott E. Cullen Lawrence Livermore National Laboratory P.O. Box 808/L-198 Livermore, CA 94550 March 17, 2012 Approved for public

More information

Row 1 This is data This is data

Row 1 This is data This is data mpdf TABLES CSS Styles The CSS properties for tables and cells is increased over that in html2fpdf. It includes recognition of THEAD, TFOOT and TH. See below for other facilities such as autosizing, and

More information

Row 1 This is data This is data. This is data out of p This is bold data p This is bold data out of p This is normal data after br H3 in a table

Row 1 This is data This is data. This is data out of p This is bold data p This is bold data out of p This is normal data after br H3 in a table mpdf TABLES CSS Styles The CSS properties for tables and cells is increased over that in html2fpdf. It includes recognition of THEAD, TFOOT and TH. See below for other facilities such as autosizing, and

More information

Prices exclude Freight and GST Page 1 of 9. Range Group Image (not to scale) Part Number Description Retail

Prices exclude Freight and GST Page 1 of 9. Range Group Image (not to scale) Part Number Description Retail Pric exclude Freight and GST Page 1 of 9 AEMMSFA101 ECONOMIC MONITOR ARM BLACK - $119 AEMMSLA01 LCD MONITOR ARM GREY - CLAMP $249 AEMMSLA02 LCD MONITOR ARM GREY - CLAMP $249 r Acc ce ss so ori e om mp

More information

Arachne. Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout

Arachne. Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout Arachne Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout Granular Computing Platform Zaharia Winstein Levis Applications Kozyrakis Cluster Scheduling Ousterhout Low-Latency RPC

More information

SPAREPARTSCATALOG: CONNECTORS SPARE CONNECTORS KTM ART.-NR.: 3CM EN

SPAREPARTSCATALOG: CONNECTORS SPARE CONNECTORS KTM ART.-NR.: 3CM EN SPAREPARTSCATALOG: CONNECTORS ART.-NR.: 3CM3208201EN CONTENT SPARE CONNECTORS AA-AN SPARE CONNECTORS AO-BC SPARE CONNECTORS BD-BQ SPARE CONNECTORS BR-CD 3 4 5 6 SPARE CONNECTORS CE-CR SPARE CONNECTORS

More information

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Larisa Stoltzfus*, Murali Emani, Pei-Hung Lin, Chunhua Liao *University of Edinburgh (UK), Lawrence Livermore National Laboratory

More information

Improving Cloud Application Performance with Simulation-Guided CPU State Management

Improving Cloud Application Performance with Simulation-Guided CPU State Management Improving Cloud Application Performance with Simulation-Guided CPU State Management Mathias Gottschlag, Frank Bellosa April 23, 2017 KARLSRUHE INSTITUTE OF TECHNOLOGY (KIT) - OPERATING SYSTEMS GROUP KIT

More information

SPARE CONNECTORS KTM 2014

SPARE CONNECTORS KTM 2014 SPAREPARTSCATALOG: // ENGINE ART.-NR.: 3208201EN CONTENT CONNECTORS FOR WIRING HARNESS AA-AN CONNECTORS FOR WIRING HARNESS AO-BC CONNECTORS FOR WIRING HARNESS BD-BQ CONNECTORS FOR WIRING HARNESS BR-CD

More information

Predicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters

Predicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters Predicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters Junaid Nomani and Jakub Szefer Computer Architecture and Security Laboratory Yale University junaid.nomani@yale.edu

More information

Training manual: An introduction to North Time Pro 2019 ESS at the terminal

Training manual: An introduction to North Time Pro 2019 ESS at the terminal Training manual: An introduction to North Time Pro 2019 ESS at the terminal Document t2-0800 Revision 15.1 Copyright North Time Pro www.ntdltd.com +44 (0) 2892 604000 For more information about North

More information

Mace, J. et al., "2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services," Proc. of ACM SIGCOMM '16, 46(4): , Aug

Mace, J. et al., 2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services, Proc. of ACM SIGCOMM '16, 46(4): , Aug Mace, J. et al., "2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services," Proc. of ACM SIGCOMM '16, 46(4):144-159, Aug. 2016. Introduction Server has limited capacity Requests by clients are

More information

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems J.C. Sáez, A. Pousa, F. Castro, D. Chaver y M. Prieto Complutense University of Madrid, Universidad Nacional de la Plata-LIDI

More information

A method to vary the Host interface signaling speeds in a Storage Array driving towards Greener Storage.

A method to vary the Host interface signaling speeds in a Storage Array driving towards Greener Storage. A method to vary the Host interface signaling speeds in a Storage Array driving towards Greener Storage. Dr. M. K. Jibbe, Technical Director NetApp, Wichita, Ks USA Mahmoud.jibbe@netapp.com 316 636 8810

More information

Datacenter application interference

Datacenter application interference 1 Datacenter application interference CMPs (popular in datacenters) offer increased throughput and reduced power consumption They also increase resource sharing between applications, which can result in

More information

ICP-OES. By: Dr. Sarhan A. Salman

ICP-OES. By: Dr. Sarhan A. Salman ICP-OES By: Dr. Sarhan A. Salman AGENDA -- ICP-OES Overview - Module removal Optical components and gas control overview Module replacement --Continue module replacement Manual optical alignment --Precise

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

Adaptive Power Profiling for Many-Core HPC Architectures

Adaptive Power Profiling for Many-Core HPC Architectures Adaptive Power Profiling for Many-Core HPC Architectures Jaimie Kelley, Christopher Stewart The Ohio State University Devesh Tiwari, Saurabh Gupta Oak Ridge National Laboratory State-of-the-Art Schedulers

More information

Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems

Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems Chris Gregg Jeff S. Brantley Kim Hazelwood Department of Computer Science, University of Virginia Abstract A typical consumer desktop

More information

Activity concentration for exempt material (Bq/g)

Activity concentration for exempt material (Bq/g) 173.436 Exempt material activity concentrations and exempt consignment activity limits for radionuclides. The Table of Exempt material activity concentrations and exempt consignment activity limits for

More information

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS Neha Agarwal* David Nellans Mike O Connor Stephen W. Keckler Thomas F. Wenisch* NVIDIA University of Michigan* (Major part of this work was done when Neha

More information

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS ARKAPRAVA BASU, JOSEPH L. GREATHOUSE, GURU VENKATARAMANI, JÁN VESELÝ AMD RESEARCH, ADVANCED MICRO DEVICES, INC. MODERN SYSTEMS ARE POWERED BY HETEROGENEITY

More information

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow

More information

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,

More information

A Distributed Hash Table for Shared Memory

A Distributed Hash Table for Shared Memory A Distributed Hash Table for Shared Memory Wytse Oortwijn Formal Methods and Tools, University of Twente August 31, 2015 Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash

More information

A Scalable Event Dispatching Library for Linux Network Servers

A Scalable Event Dispatching Library for Linux Network Servers A Scalable Event Dispatching Library for Linux Network Servers Hao-Ran Liu and Tien-Fu Chen Dept. of CSIE National Chung Cheng University Traditional server: Multiple Process (MP) server A dedicated process

More information

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS Prabodha Srimal Rodrigo Registration No. : 138230V Degree of Master of Science Department of Computer Science & Engineering University

More information

Cider. Native Execution of ios Apps on Android. Alexander Van't Hof, Naser AlDuaij, Christoffer Dall, Nicolas Viennot, Jason Nieh. Columbia University

Cider. Native Execution of ios Apps on Android. Alexander Van't Hof, Naser AlDuaij, Christoffer Dall, Nicolas Viennot, Jason Nieh. Columbia University Cider Native Execution of Apps on Jeremy Andrus (jeremya@cs.columbia.edu) Alexander Van't Hof, Naser AlDuaij, Christoffer Dall, Nicolas Viennot, Jason Nieh Columbia University In the City of New York March

More information

Batch Jobs Performance Testing

Batch Jobs Performance Testing Batch Jobs Performance Testing October 20, 2012 Author Rajesh Kurapati Introduction Batch Job A batch job is a scheduled program that runs without user intervention. Corporations use batch jobs to automate

More information

ProdDiagNode - Version: 1. Production Diagnostics for Node Applications

ProdDiagNode - Version: 1. Production Diagnostics for Node Applications ProdDiagNode - Version: 1 Production Diagnostics for Node Applications Production Diagnostics for Node Applications ProdDiagNode - Version: 1 2 days Course Description: Node.js, the popular cross-platform

More information

CSharp. Microsoft. PRO-Design & Develop Enterprise Appl by Using MS.NET Frmwk

CSharp. Microsoft. PRO-Design & Develop Enterprise Appl by Using MS.NET Frmwk Microsoft 70-549-CSharp PRO-Design & Develop Enterprise Appl by Using MS.NET Frmwk Download Full Version : http://killexams.com/pass4sure/exam-detail/70-549-csharp QUESTION: 170 You are an enterprise application

More information

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Critically Missing Pieces on Accelerators: A Performance Tools Perspective Critically Missing Pieces on Accelerators: A Performance Tools Perspective, Karthik Murthy, Mike Fagan, and John Mellor-Crummey Rice University SC 2013 Denver, CO November 20, 2013 What Is Missing in GPUs?

More information

Performance and Scalability of Server Consolidation

Performance and Scalability of Server Consolidation Performance and Scalability of Server Consolidation August 2010 Andrew Theurer IBM Linux Technology Center Agenda How are we measuring server consolidation? SPECvirt_sc2010 How is KVM doing in an enterprise

More information

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU GaaS Workload Characterization under NUMA Architecture for Virtualized GPU Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li Presented by Huixiang Chen ISPASS 2017 April 24, 2017, Santa Rosa, California

More information

Heterogeneous-Race-Free Memory Models

Heterogeneous-Race-Free Memory Models Heterogeneous-Race-Free Memory Models Jyh-Jing (JJ) Hwang, Yiren (Max) Lu 02/28/2017 1 Outline 1. Background 2. HRF-direct 3. HRF-indirect 4. Experiments 2 Data Race Condition op1 op2 write read 3 Sequential

More information

On the Accuracy and Scalability of Intensive I/O Workload Replay

On the Accuracy and Scalability of Intensive I/O Workload Replay On the Accuracy and Scalability of Intensive I/O Workload Replay Alireza Haghdoost, Weiping He, Jerry Fredin *, David H.C. Du University of Minnesota, * NetApp Inc. Center for Research in Intelligent Storage

More information

KEYNOTE FOR

KEYNOTE FOR THE OS SCHEDULER: A PERFORMANCE-CRITICAL COMPONENT IN LINUX CLUSTER ENVIRONMENTS KEYNOTE FOR BPOE-9 @ASPLOS2018 THE NINTH WORKSHOP ON BIG DATA BENCHMARKS, PERFORMANCE, OPTIMIZATION AND EMERGING HARDWARE

More information

M a p s 2. 0 T h e c o o p era tive w ay t o ke e p m a p s u p t o d a t e. O l iver. Kü h n

M a p s 2. 0 T h e c o o p era tive w ay t o ke e p m a p s u p t o d a t e. O l iver. Kü h n M a p s 2. 0 T h e c o o p era tive w ay t o ke e p m a p s u p t o d a t e O l iver Kü h n s k obble r G m b H M a i 2 0 1 3 This is a stor y about how we all get unconscious ly involved in creating digital

More information

ibench: Quantifying Interference in Datacenter Applications

ibench: Quantifying Interference in Datacenter Applications ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization

More information

DIDO: Dynamic Pipelines for In-Memory Key-Value Stores on Coupled CPU-GPU Architectures

DIDO: Dynamic Pipelines for In-Memory Key-Value Stores on Coupled CPU-GPU Architectures DIDO: Dynamic Pipelines for In-Memory Key-Value Stores on Coupled CPU-GPU rchitectures Kai Zhang, Jiayu Hu, Bingsheng He, Bei Hua School of Computing, National University of Singapore School of Computer

More information

Bolt: I Know What You Did Last Summer In the Cloud

Bolt: I Know What You Did Last Summer In the Cloud Bolt: I Know What You Did Last Summer In the Cloud Christina Delimitrou 1 and Christos Kozyrakis 2 1 Cornell University, 2 Stanford University ASPLOS April 12 th 2017 Executive Summary Problem: cloud resource

More information

Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen

Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen Peking University, China Advanced Digital Science Center, Singapore University of Illinois at Urbana

More information

SIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13

SIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13 SIMD Divergence Optimization through Intra-Warp Compaction Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13 Problem GPU: wide SIMD lanes 16 lanes per warp in this work SIMD

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

BigDataBench-MT: Multi-tenancy version of BigDataBench

BigDataBench-MT: Multi-tenancy version of BigDataBench BigDataBench-MT: Multi-tenancy version of BigDataBench Gang Lu Beijing Academy of Frontier Science and Technology BigDataBench Tutorial, ASPLOS 2016 Atlanta, GA, USA n Software perspective Multi-tenancy

More information

Lionbridge ondemand for Adobe Experience Manager

Lionbridge ondemand for Adobe Experience Manager Lionbridge ondemand for Adobe Experience Manager Version 1.1.0 Configuration Guide October 24, 2017 Copyright Copyright 2017 Lionbridge Technologies, Inc. All rights reserved. Published in the USA. March,

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

BFQ, fairness and low latency in block I/O The Two Towers. Paolo Valente

BFQ, fairness and low latency in block I/O The Two Towers. Paolo Valente BFQ, fairness and low latency in block I/O The Two Towers Paolo Valente Contents Why BFQ? What happened since the last episode? What is still to come? The two Towers How is it going in general with latency

More information

RS-232C Interface Command Table

RS-232C Interface Command Table RS-232C Interface Command Table EN 73 Lower OrderN Higher OrderQ 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 Complete Error Cassette Out Not Target ACK NAK 1 2 3 VCR Selection DVD Selection Play Stop 4 Still 5 Clear

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015 Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5 Graph is Ubiquitous Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path

More information

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, *Gunjae Koo, Won Woo Ro, *Murali Annavaram Yonsei University *University of Southern

More information

Instruction Sets: Characteristics and Functions Addressing Modes

Instruction Sets: Characteristics and Functions Addressing Modes Instruction Sets: Characteristics and Functions Addressing Modes Chapters 10 and 11, William Stallings Computer Organization and Architecture 7 th Edition What is an Instruction Set? The complete collection

More information

Multitasking and scheduling

Multitasking and scheduling Multitasking and scheduling Guillaume Salagnac Insa-Lyon IST Semester Fall 2017 2/39 Previously on IST-OPS: kernel vs userland pplication 1 pplication 2 VM1 VM2 OS Kernel rchitecture Hardware Each program

More information

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system

More information

STEPS Towards Cache-Resident Transaction Processing

STEPS Towards Cache-Resident Transaction Processing STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I

More information

FLIGHTS TO / FROM CANADA ARE DOMESTIC

FLIGHTS TO / FROM CANADA ARE DOMESTIC MINIMUM CONNECTING TIME Houston, USA FLIGHTS TO / FROM CANADA ARE DOMESTIC IAH (Intercontinental Airport) DOMESTIC TO DOMESTIC :45 AA TO DL :40 CO TO AC :30 AA, UA, US :20 CO :30 DL, WN :25 DOMESTIC TO

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems

Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems Ben Taylor, Vicent Sanz Marco, Zheng Wang School of Computing and Communications, Lancaster University, UK {b.d.taylor, v.sanzmarco,

More information

Vulkan: Scaling to Multiple Threads. Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics

Vulkan: Scaling to Multiple Threads. Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics Vulkan: Scaling to Multiple Threads Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics www.imgtec.com Introduction Who am I? Kevin Sun Working at Imagination Technologies Take responsibility

More information

Towards Fair and Efficient SMP Virtual Machine Scheduling

Towards Fair and Efficient SMP Virtual Machine Scheduling Towards Fair and Efficient SMP Virtual Machine Scheduling Jia Rao and Xiaobo Zhou University of Colorado, Colorado Springs http://cs.uccs.edu/~jrao/ Executive Summary Problem: unfairness and inefficiency

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads Ran Xu (Purdue), Subrata Mitra (Adobe Research), Jason Rahman (Facebook), Peter Bai (Purdue),

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

Benchmark Study: A Performance Comparison Between RHEL 5 and RHEL 6 on System z

Benchmark Study: A Performance Comparison Between RHEL 5 and RHEL 6 on System z Benchmark Study: A Performance Comparison Between RHEL 5 and RHEL 6 on System z 1 Lab Environment Setup Hardware and z/vm Environment z10 EC, 4 IFLs used (non concurrent tests) Separate LPARs for RHEL

More information

QoS-Aware Admission Control in Heterogeneous Datacenters

QoS-Aware Admission Control in Heterogeneous Datacenters QoS-Aware Admission Control in Heterogeneous Datacenters Christina Delimitrou, Nick Bambos and Christos Kozyrakis Stanford University ICAC June 28 th 2013 Cloud DC Scheduling Workloads DC Scheduler S S

More information

Position Paper: OpenMP scheduling on ARM big.little architecture

Position Paper: OpenMP scheduling on ARM big.little architecture Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1, Ari B. Hayes 1, Chi Zhang 2, Timothy Salmon 1, Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh Sparse

More information

PROCESS SCHEDULING II. CS124 Operating Systems Fall , Lecture 13

PROCESS SCHEDULING II. CS124 Operating Systems Fall , Lecture 13 PROCESS SCHEDULING II CS124 Operating Systems Fall 2017-2018, Lecture 13 2 Real-Time Systems Increasingly common to have systems with real-time scheduling requirements Real-time systems are driven by specific

More information

Cl. Cl. ..., V, V, -I..., - QJ d.

Cl. Cl. ..., V, V, -I..., - QJ d. )> Cl. Cl...., m V, V, -I..., QJ :J V, - QJ d. 0 :J Main Points Address Translation Concept - How do we convert a virtual address to a physical address? Flexible Address Translation I R. lz G.z: f1..t:?z:.mo/2_'-(

More information

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Workload Characterization and Optimization of TPC-H Queries on Apache Spark Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research

More information

Amir Zipory Senior Solutions Architect, Redhat Israel, Greece & Cyprus

Amir Zipory Senior Solutions Architect, Redhat Israel, Greece & Cyprus Amir Zipory Senior Solutions Architect, Redhat Israel, Greece & Cyprus amirz@redhat.com TODAY'S IT CHALLENGES IT is under tremendous pressure from the organization to enable growth Need to accelerate,

More information

Multi-Level Feedback Queues

Multi-Level Feedback Queues CS 326: Operating Systems Multi-Level Feedback Queues Lecture 8 Today s Schedule Building an Ideal Scheduler Priority-Based Scheduling Multi-Level Queues Multi-Level Feedback Queues Scheduling Domains

More information

TMCH Report March February 2017

TMCH Report March February 2017 TMCH Report March 2013 - February 2017 Contents Contents 2 1 Trademark Clearinghouse global reporting 3 1.1 Number of jurisdictions for which a trademark record has been submitted for 3 2 Trademark Clearinghouse

More information

Proactive System Check Supported Product List as of 14 April 2017

Proactive System Check Supported Product List as of 14 April 2017 Proactive System Check Supported Product List as of 14 April 2017 Expansions - no PSC price as attached within base unit price Tape drives - no PSC price as attached within base unit price Racks - no PSC

More information

Strong Scaling for Molecular Dynamics Applications

Strong Scaling for Molecular Dynamics Applications Strong Scaling for Molecular Dynamics Applications Scaling and Molecular Dynamics Strong Scaling How does solution time vary with the number of processors for a fixed problem size Classical Molecular Dynamics

More information

Database Acceleration Solution Using FPGAs and Integrated Flash Storage

Database Acceleration Solution Using FPGAs and Integrated Flash Storage Database Acceleration Solution Using FPGAs and Integrated Flash Storage HK Verma, Xilinx Inc. August 2017 1 FPGA Analytics in Flash Storage System In-memory or Flash storage based DB reduce disk access

More information

SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH

SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH LAYER CAKE Application Runtime OS Kernel ISA Physical RAM 2 COMMODITY

More information

CD _. _. 'p ~~M CD, CD~~~~V. C ~'* Co ~~~~~~~~~~~~- CD / X. pd.0 & CD. On 0 CDC _ C _- CD C P O ttic 2 _. OCt CD CD (IQ. q"3. 3 > n)1t.

CD _. _. 'p ~~M CD, CD~~~~V. C ~'* Co ~~~~~~~~~~~~- CD / X. pd.0 & CD. On 0 CDC _ C _- CD C P O ttic 2 _. OCt CD CD (IQ. q3. 3 > n)1t. n 5 L n q"3 +, / X g ( E 4 11 " ') $ n 4 ) w Z$ > _ X ~'* ) i 1 _ /3 L 2 _ L 4 : 5 n W 9 U~~~~~~ 5 T f V ~~~~~~~~~~~~ (Q ' ~~M 3 > n)1 % ~~~~V v,~~ _ + d V)m X LA) z~~11 4 _ N cc ', f 'd 4 5 L L " V +,

More information

Parallel Processing for Data Deduplication

Parallel Processing for Data Deduplication Parallel Processing for Data Deduplication Peter Sobe, Denny Pazak, Martin Stiehr Faculty of Computer Science and Mathematics Dresden University of Applied Sciences Dresden, Germany Corresponding Author

More information

Performance Baseline of Oracle Exadata X2-2 HR HC

Performance Baseline of Oracle Exadata X2-2 HR HC Performance Baseline of Oracle Exadata X2-2 HR HC Part I: CPU Performance Benchware Performance Suite Release 8.4 (Build 130630) June 2013 Contents 1 Introduction to CPU Performance Tests 2 CPU and Server

More information

CS 326: Operating Systems. CPU Scheduling. Lecture 6

CS 326: Operating Systems. CPU Scheduling. Lecture 6 CS 326: Operating Systems CPU Scheduling Lecture 6 Today s Schedule Agenda? Context Switches and Interrupts Basic Scheduling Algorithms Scheduling with I/O Symmetric multiprocessing 2/7/18 CS 326: Operating

More information

Proactive System Check Supported Product List as of March 27, 2018

Proactive System Check Supported Product List as of March 27, 2018 Proactive System Check Supported Product List as of March 27, 2018 Expansions - no PSC price as attached within base unit price Tape drives - no PSC price as attached within base unit price Racks - no

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to A Sleep-based Our Akram Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) May 1, 2011 Our 1 2 Our 3 4 5 6 Our Efficiency in Back-end Processing Efficiency in back-end

More information

OPAL-RT FIU User Guide.

OPAL-RT FIU User Guide. OPAL-RT FIU User Guide www.opal-rt.com Published by Opal-RT Technologies, Inc. 1751 Richardson, suite 2525 Montreal, Quebec, Canada H3K 1G6 www.opal-rt.com 2013 Opal-RT Technologies, Inc. All rights reserved

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 7

ECE 571 Advanced Microprocessor-Based Design Lecture 7 ECE 571 Advanced Microprocessor-Based Design Lecture 7 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 9 February 2017 Announcements HW#4 will be posted, some readings 1 Measuring

More information

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline

More information

Evaluating the Error Resilience of Parallel Programs

Evaluating the Error Resilience of Parallel Programs Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi AMD Research 1 Reliability trends The soft error

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

Scorciatoie da tastiera - Circad 4.09

Scorciatoie da tastiera - Circad 4.09 Scorciatoie da tastiera - Circad 4.09 File FE New File - File FO Open File - File FR Re-Open - File FT Total Recall - File FD Load - File FC Close File - File FE Erase File - File FS Save File - File FA

More information

Huge market -- essentially all high performance databases work this way

Huge market -- essentially all high performance databases work this way 11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch

More information

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based

More information

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency MediaTek CorePilot 2.0 Heterogeneous Computing Technology Delivering extreme compute performance with maximum power efficiency In July 2013, MediaTek delivered the industry s first mobile system on a chip

More information

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs Authors: Jos e L. Abell an, Juan Fern andez and Manuel E. Acacio Presenter: Guoliang Liu Outline Introduction Motivation Background

More information

Monitoring Simplified.

Monitoring Simplified. Monitoring Simplified http://nsclient.org How many use NSClient++ NS-what did he say??#@*&%! wrong room! How many like NSClient++?..pdh collection thread not running ERROR: Missing argument exception PdhCollectQueryData?

More information