Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting

Size: px
Start display at page:

Download "Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting"

Transcription

1 Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting Animesh Jain, Michael A. Laurenzano, Lingjia Tang and Jason Mars International Symposium on Microarchitecture (MICRO), 2016 October 18, 2016

2 Rampant Dynamism in Datacenters Datacenters

3 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Datacenters

4 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Co-running of applications Datacenters

5 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Microarchitectural flexibility Co-running of applications Datacenters

6 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Microarchitectural flexibility Co-running of applications Platform diversity Datacenters

7 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Microarchitectural flexibility Co-running of applications Platform diversity Datacenters Dynamism affects the runtime availability of resources

8 Static Compiler Optimizations Compilation assumptions might not be met at runtime

9 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism

10 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse

11 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism

12 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal

13 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal

14 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal

15 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal Co-running application

16 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal Co-running application

17 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal Co-running application Partitioned cache

18 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal Co-running application Partitioned cache Different architecture

19 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Ideal Normal Co-running application Partitioned cache Different architecture

20 Co-runner Tiling Comparison Static vs Dynamic

21 Co-runner Tiling Comparison Static vs Dynamic

22 Co-runner Tiling Comparison Static vs Dynamic

23 Co-runner Tiling Comparison Static vs Dynamic Static vs Dynamic Static vs Dynamic

24 Co-runner Tiling Comparison Static vs Dynamic Static vs Dynamic Static vs Dynamic Dynamism requires rethinking cache tiling

25 Design Objectives Dynamic Should react to changes in runtime environment

26 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy

27 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead

28 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches

29 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches White-box approach BLAS libraries Dynamic Accuracy Low-overhead

30 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches Math kernel libraries like Intel MKL, ATLAS White-box approach BLAS libraries Dynamic Accuracy Low-overhead

31 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches Math kernel libraries like Intel MKL, ATLAS White-box approach BLAS libraries Dynamic Accuracy Low-overhead

32 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches Math kernel libraries like Intel MKL, ATLAS White-box approach BLAS libraries Dynamic Accuracy Low-overhead Online generation of a black-box model

33 Shape Shifter

34 Key Components Dynamic tile generation Tiled loop Application 1

35 Key Components Dynamic tile generation Companion thread (Protean Code + Polly) Tiled loop Code cache Dynamic compiler Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008

36 Key Components Dynamic tile generation Detect tiling opportunities Companion thread (Protean Code + Polly) Tiled loop Code cache Dynamic compiler Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008

37 Key Components Dynamic tile generation Detect tiling opportunities Companion thread (Protean Code + Polly) Runtime Environment Monitor (REM) Tiled loop Dynamic compiler REM Code cache Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008

38 Key Components Dynamic tile generation Detect tiling opportunities Find a high-performant tile Companion thread (Protean Code + Polly) Runtime Environment Monitor (REM) Tiled loop Dynamic compiler REM Code cache Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008

39 Key Components Dynamic tile generation Detect tiling opportunities Find a high-performant tile Companion thread (Protean Code + Polly) Runtime Environment Monitor (REM) Tile Selector Tiled loop Code cache Dynamic compiler REM Tile selector Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008

40 Key Components Dynamic tile generation Detect tiling opportunities Find a high-performant tile Companion thread (Protean Code + Polly) Runtime Environment Monitor (REM) Tile Selector Tiled loop Code cache Dynamic compiler Z Z 1 2 Companion controller REM Tile selector Z Z Application 1 Companion 1 Application 2 Companion 2 ShapeShifter Protean Code, MICRO 2014 and Polly, PLDI 2008

41 Overview Dynamic compiler Tile selector REM

42 Overview Online training select tile size and generate training data Dynamic compiler Tile selector REM Online training Find tile size Training set Collect cache stats

43 Overview Online training select tile size and generate training data Tile selection generate black-box model and select suitable tile shape Dynamic compiler Tile selector REM Online training Tile selection Find tile size Training set Collect cache stats Tile performance model Choose tile

44 Overview Online training select tile size and generate training data Tile selection generate black-box model and select suitable tile shape Monitored execution detect tiling opportunities Dynamic compiler Tile selector REM Online training Find tile size Training set Collect cache stats Tile performance model Tile selection Choose tile Monitored execution

45 Overview Online training select tile size and generate training data Tile selection generate black-box model and select suitable tile shape Monitored execution detect tiling opportunities Dynamic compiler Tile selector REM Online training Find tile size Training set Collect cache stats Tile performance model Tile selection Choose tile Monitored execution Runtime environment change

46 Overview Online training select tile size and generate training data Tile selection generate black-box model and select suitable tile shape Monitored execution detect tiling opportunities Dynamic compiler Tile selector REM Online training Find tile size Training set Collect cache stats Tile performance model Tile selection Choose tile Monitored execution Runtime environment change

47 Tile Selection Black-box model is generated online

48 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Training data Black-box model

49 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Training data IPC Tile parameters Black-box model

50 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Training data IPC Tile parameters Black-box model

51 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Predicts a tile suitable to current runtime environment Training data IPC IPC pred Tile parameters Set of tile shapes of predicted size Black-box model

52 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Predicts a tile suitable to current runtime environment Training data IPC IPC max IPC pred Set of tile shapes of predicted size Tile parameters Black-box model T shapeshifter

53 Insight for Co-optimization Challenging to retile multiple applications simultaneously

54 Insight for Co-optimization Challenging to retile multiple applications simultaneously Tile shape and tile size contribute differently to cache interference

55 Insight for Co-optimization Challenging to retile multiple applications simultaneously Tile shape and tile size contribute differently to cache interference

56 Insight for Co-optimization Challenging to retile multiple applications simultaneously Tile shape and tile size contribute differently to cache interference Co-optimization Find tile size for apps and then tile shape one-by-one

57 Experimental Evaluation

58 Methodology Polybench application suite

59 Methodology Polybench application suite Three sources of dynamism Co-running applications Microarchitectural flexibility cache partitioning Platform diversity

60 Methodology Polybench application suite Three sources of dynamism Co-running applications Microarchitectural flexibility cache partitioning Platform diversity Three platforms AMD Bulldozer Intel Haswell Intel Atom

61 Methodology Polybench application suite Three sources of dynamism Co-running applications Microarchitectural flexibility cache partitioning Platform diversity Three platforms AMD Bulldozer Intel Haswell Intel Atom Tiling is performed in the shared cache

62 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner

63 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner

64 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner

65 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner Co-runner change syr2k to correlation

66 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner Co-runner change syr2k to correlation Change in cache allocations

67 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)

68 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)

69 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)

70 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)

71 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)

72 Platform Diversity Platform diversity Intel Atom, Intel Haswell and AMD Bulldozer Static Best best tile on AMD Bulldozer

73 Platform Diversity Platform diversity Intel Atom, Intel Haswell and AMD Bulldozer Static Best best tile on AMD Bulldozer

74 Platform Diversity Platform diversity Intel Atom, Intel Haswell and AMD Bulldozer Static Best best tile on AMD Bulldozer

75 Conclusions ShapeShifter an end to end dynamic loop co-optimization

76 Conclusions ShapeShifter an end to end dynamic loop co-optimization Adapt tiling strategy to the application runtime environment

77 Conclusions ShapeShifter an end to end dynamic loop co-optimization Adapt tiling strategy to the application runtime environment Loop co-optimization tiling multiple applications on the fly

78 Conclusions ShapeShifter an end to end dynamic loop co-optimization Adapt tiling strategy to the application runtime environment Loop co-optimization tiling multiple applications on the fly Novel black-box modelling approach fast and accurate

79 Conclusions ShapeShifter an end to end dynamic loop co-optimization Adapt tiling strategy to the application runtime environment Loop co-optimization tiling multiple applications on the fly Novel black-box modelling approach fast and accurate ShapeShifter achieves significant performance improvements across different sources of dynamism

80 Q/A

81 Why black-box model works? There is trade-off between the best tiling stragey and performance We show that SS chooses a close one Why 3 D tiling? Build on Polly but technique is not restricted to 3D tiling Also memorize the compilation times 2 reasons of slowdown tile doesn t matter, black-box model not good enough Remember cache sizes Prior work refresh 18

82 Overhead Companion thread Three sources of overhead Dynamic Compilation 136 ms on Intel Haswell, 430 ms on AMD Bulldozer Code redirection Training 19

83 Overhead training 20

84 Black-box model Multiple high-performance tiles ShapeShifter chooses one of the high-performanc e tiles 21

85 ShapeShifter vs Dynamic Oracle ShapeShifter achieves 93% of the dynamic oracle performance on average 22

86 Co-runner 23

Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting

Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting Animesh Jain, Michael A. Laurenzano, Lingjia Tang and Jason Mars University of Michigan, Ann Arbor {anijain,

More information

SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers

SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers Yunqi Zhang, Michael A. Laurenzano, Jason Mars, Lingjia Tang Clarity-Lab Electrical Engineering

More information

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Presented by Hadeel Alabandi Introduction and Motivation 2 A serious issue to the effective utilization

More information

Datacenter application interference

Datacenter application interference 1 Datacenter application interference CMPs (popular in datacenters) offer increased throughput and reduced power consumption They also increase resource sharing between applications, which can result in

More information

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg University of Freiburg BLIS retreat September 19, 217 Basic Linear Algebra Subroutines For Embedded Optimization performance dgemm_nt 5 4 Intel Core i7 48MQ HP OpenBLAS.2.19 MKL 217.2.174 ATLAS 3.1.3 BLIS.1.6

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Code transformations for energy efficiency; a decoupled accessexecute

Code transformations for energy efficiency; a decoupled accessexecute Code transformations for energy efficiency; a decoupled accessexecute approach Work performed at Uppsala University Konstantinos Koukos November 2016 OVERALL GOAL The big goal Better exploit the potential

More information

An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors

An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors Jiacheng Zhao Institute of Computing Technology, CAS In Conjunction with Prof. Jingling Xue, UNSW, Australia

More information

Virtual Melting Temperature: Managing Server Load to Minimize Cooling Overhead with Phase Change Materials

Virtual Melting Temperature: Managing Server Load to Minimize Cooling Overhead with Phase Change Materials Virtual Melting Temperature: Managing Server Load to Minimize Cooling Overhead with Phase Change Materials Matt Skach1, Manish Arora2,3, Dean Tullsen3, Lingjia Tang1, Jason Mars1 University of Michigan1

More information

Low-overhead Online Code Transformations

Low-overhead Online Code Transformations Low-overhead Online Code Transformations by Michael A. Laurenzano A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering)

More information

Addressing Memory Bottlenecks for Emerging Applications

Addressing Memory Bottlenecks for Emerging Applications Addressing Memory Bottlenecks for Emerging Applications by Animesh Jain A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and

More information

Mitigating Resource Contention in Warehouse Scale Computers

Mitigating Resource Contention in Warehouse Scale Computers 1 Mitigating Resource Contention in Warehouse Scale Computers A Dissertation Presented to the faculty of the School of Engineering and Applied Science University of Virginia In Partial Fulfillment of the

More information

*Yuta SAWA and Reiji SUDA The University of Tokyo

*Yuta SAWA and Reiji SUDA The University of Tokyo Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi,

More information

Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers

Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Cheng Li, Austin Rovinski,

More information

Lixia Liu, Zhiyuan Li Purdue University, USA. grants ST-HEC , CPA and CPA , and by a Google Fellowship

Lixia Liu, Zhiyuan Li Purdue University, USA. grants ST-HEC , CPA and CPA , and by a Google Fellowship Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009 Work supported in part by NSF through Work supported in part by NSF through grants ST-HEC-0444285, CPA-0702245 and CPA-0811587, and

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems J.C. Sáez, A. Pousa, F. Castro, D. Chaver y M. Prieto Complutense University of Madrid, Universidad Nacional de la Plata-LIDI

More information

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory

More information

Managing GPU Concurrency in Heterogeneous Architectures

Managing GPU Concurrency in Heterogeneous Architectures Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures

More information

Graphics Performance Analyzer for Android

Graphics Performance Analyzer for Android Graphics Performance Analyzer for Android 1 What you will learn from this slide deck Detailed optimization workflow of Graphics Performance Analyzer Android* System Analysis Only Please see subsequent

More information

Multi-core Programming Evolution

Multi-core Programming Evolution Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016 AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING BILL.BRANTLEY@AMD.COM, FELLOW 3 OCTOBER 2016 AMD S VISION FOR EXASCALE COMPUTING EMBRACING HETEROGENEITY CHAMPIONING OPEN SOLUTIONS ENABLING LEADERSHIP

More information

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 0 J A N L E M E I R E Course Objectives 2 Intel 4004 1971 2.3K trans. Intel Core 2 Duo 2006 291M trans. Where have all the transistors gone? Turing Machine

More information

Defensive Loop Tiling for Shared Cache. Bin Bao Adobe Systems Chen Ding University of Rochester

Defensive Loop Tiling for Shared Cache. Bin Bao Adobe Systems Chen Ding University of Rochester Defensive Loop Tiling for Shared Cache Bin Bao Adobe Systems Chen Ding University of Rochester Bird and Program Unlike a bird, which can learn to fly better and better, existing programs are sort of dumb---the

More information

Iterative Compilation with Kernel Exploration

Iterative Compilation with Kernel Exploration Iterative Compilation with Kernel Exploration Denis Barthou 1 Sébastien Donadio 12 Alexandre Duchateau 1 William Jalby 1 Eric Courtois 3 1 Université de Versailles, France 2 Bull SA Company, France 3 CAPS

More information

Chapter 18 - Multicore Computers

Chapter 18 - Multicore Computers Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca

More information

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017 THE FUTURE OF GPU DATA MANAGEMENT Michael Wolfe, May 9, 2017 CPU CACHE Hardware managed What data to cache? Where to store the cached data? What data to evict when the cache fills up? When to store data

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors ACM IEEE 37 th International Symposium on Computer Architecture Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors Enric Herrero¹, José González²,

More information

Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view

Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view 1 Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view Pierre Michaud INRIA HiPEAC 11, January 26, 2011 2 Outline Self-performance contract Proposition for

More information

Half full or half empty? William Gropp Mathematics and Computer Science

Half full or half empty? William Gropp Mathematics and Computer Science Half full or half empty? William Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp MPI on Multicore Processors Work of Darius Buntinas and Guillaume Mercier 340 ns MPI ping/pong latency More

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

EN105 : Computer architecture. Course overview J. CRENNE 2015/2016

EN105 : Computer architecture. Course overview J. CRENNE 2015/2016 EN105 : Computer architecture Course overview J. CRENNE 2015/2016 Schedule Cours Cours Cours Cours Cours Cours Cours Cours Cours Cours 2 CM 1 - Warmup CM 2 - Computer architecture CM 3 - CISC2RISC CM 4

More information

Non-uniform memory access (NUMA)

Non-uniform memory access (NUMA) Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access

More information

Postprint. This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden.

Postprint.   This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. http://www.diva-portal.org Postprint This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. Citation for the original published paper: Ceballos, G., Black-Schaffer,

More information

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101 18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture

More information

Neural Network Assisted Tile Size Selection

Neural Network Assisted Tile Size Selection Neural Network Assisted Tile Size Selection Mohammed Rahman, Louis-Noël Pouchet and P. Sadayappan Dept. of Computer Science and Engineering Ohio State University June 22, 2010 iwapt 2010 Workshop Berkeley,

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017 Achieving Peak Performance on Intel Hardware Intel Software Developer Conference London, 2017 Welcome Aims for the day You understand some of the critical features of Intel processors and other hardware

More information

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Exploration of Cache Coherent CPU- FPGA Heterogeneous System Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based

More information

SYSTEM REQUIREMENTS M.APP ENTERPRISE

SYSTEM REQUIREMENTS M.APP ENTERPRISE SYSTEM REQUIREMENTS M.APP ENTERPRISE Description or Document Category October 06, 2016 Contents M.App Enterprise Server... 3 Hardware requirements... 3 Disk space requirements... 3 Production environment

More information

Computer Architecture

Computer Architecture Computer Architecture Lecture 1: Introduction and Basics Dr. Ahmed Sallam Suez Canal University Spring 2016 Based on original slides by Prof. Onur Mutlu I Hope You Are Here for This Programming How does

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel Visual Fortran Compiler Professional Edition for Windows*........................ 3 Features...3 New in This

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

Simulating Stencil-based Application on Future Xeon-Phi Processor

Simulating Stencil-based Application on Future Xeon-Phi Processor Simulating Stencil-based Application on Future Xeon-Phi Processor PMBS workshop at SC 15 Chitra Natarajan Carl Beckmann Anthony Nguyen Intel Corporation Intel Corporation Intel Corporation Mauricio Araya-Polo

More information

Compilation for Heterogeneous Platforms

Compilation for Heterogeneous Platforms Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Anastasia Ailamaki. Performance and energy analysis using transactional workloads Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:

More information

Milind Kulkarni Research Statement

Milind Kulkarni Research Statement Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers

More information

Adaptive Cache Partitioning on a Composite Core

Adaptive Cache Partitioning on a Composite Core Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,

More information

Virtualization. Dr. Yingwu Zhu

Virtualization. Dr. Yingwu Zhu Virtualization Dr. Yingwu Zhu Virtualization Definition Framework or methodology of dividing the resources of a computer into multiple execution environments. Types Platform Virtualization: Simulate a

More information

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

Simone Campanoni Loop transformations

Simone Campanoni Loop transformations Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple

More information

Parallelism and runtimes

Parallelism and runtimes Parallelism and runtimes Advanced Course on Compilers Spring 2015 (III-V): Lecture 7 Vesa Hirvisalo ESG/CSE/Aalto Today Parallel platforms Concurrency Consistency Examples of parallelism Regularity of

More information

Distributed Systems COMP 212. Lecture 18 Othon Michail

Distributed Systems COMP 212. Lecture 18 Othon Michail Distributed Systems COMP 212 Lecture 18 Othon Michail Virtualisation & Cloud Computing 2/27 Protection rings It s all about protection rings in modern processors Hardware mechanism to protect data and

More information

Rethinking the Architecture of Warehouse-Scale Computers

Rethinking the Architecture of Warehouse-Scale Computers 1 Rethinking the Architecture of Warehouse-Scale Computers A Dissertation Presented to the faculty of the School of Engineering and Applied Science University of Virginia In Partial Fulfillment of the

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA: CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.126A ACES Email: pingali@cs.utexas.edu TA: Aditya Rawal Email: 83.aditya.rawal@gmail.com University of Texas,

More information

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads Ran Xu (Purdue), Subrata Mitra (Adobe Research), Jason Rahman (Facebook), Peter Bai (Purdue),

More information

Blurred Persistence in Transactional Persistent Memory

Blurred Persistence in Transactional Persistent Memory Blurred Persistence in Transactional Persistent Memory Youyou Lu, Jiwu Shu, Long Sun Tsinghua University Overview Problem: high performance overhead in ensuring storage consistency of persistent memory

More information

RUBIK: FAST ANALYTICAL POWER MANAGEMENT

RUBIK: FAST ANALYTICAL POWER MANAGEMENT RUBIK: FAST ANALYTICAL POWER MANAGEMENT FOR LATENCY-CRITICAL SYSTEMS HARSHAD KASTURE, DAVIDE BARTOLINI, NATHAN BECKMANN, DANIEL SANCHEZ MICRO 2015 Motivation 2! Low server utilization in today s datacenters

More information

Using Fast and Accurate Simulation to Explore Hardware/Software Trade-offs in the Multi-Core Era

Using Fast and Accurate Simulation to Explore Hardware/Software Trade-offs in the Multi-Core Era Using Fast and Accurate Simulation to Explore Hardware/Software Trade-offs in the Multi-Core Era Wim HEIRMAN a,c,1, Trevor E. CARLSON a,c Souradip SARKAR a,c Pieter GHYSELS b,c Wim VANROOSE b Lieven EECKHOUT

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Practical High Performance Computing

Practical High Performance Computing Practical High Performance Computing Donour Sizemore July 21, 2005 2005 ICE Purpose of This Talk Define High Performance computing Illustrate how to get started 2005 ICE 1 Preliminaries What is high performance

More information

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 10: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Jongsoo Park, Parallel Computing Lab, Intel Corporation with contributions from MKL team 1 Algorithm/

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Modern systems: multicore issues

Modern systems: multicore issues Modern systems: multicore issues By Paul Grubbs Portions of this talk were taken from Deniz Altinbuken s talk on Disco in 2009: http://www.cs.cornell.edu/courses/cs6410/2009fa/lectures/09-multiprocessors.ppt

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution

More information

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

MRPB: Memory Request Priori1za1on for Massively Parallel Processors

MRPB: Memory Request Priori1za1on for Massively Parallel Processors MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University Benefits of GPU Caches

More information

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.26A ACES Email: pingali@cs.utexas.edu TA: Xin Sui Email: xin@cs.utexas.edu University of Texas, Austin Fall

More information

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting

More information

Lock vs. Lock-free Memory Project proposal

Lock vs. Lock-free Memory Project proposal Lock vs. Lock-free Memory Project proposal Fahad Alduraibi Aws Ahmad Eman Elrifaei Electrical and Computer Engineering Southern Illinois University 1. Introduction The CPU performance development history

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

High Performance Ocean Modeling using CUDA

High Performance Ocean Modeling using CUDA using CUDA Chris Lupo Computer Science Cal Poly Slide 1 Acknowledgements Dr. Paul Choboter Jason Mak Ian Panzer Spencer Lines Sagiv Sheelo Jake Gardner Slide 2 Background Joint research with Dr. Paul Choboter

More information

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1]) EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,

More information

Predicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters

Predicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters Predicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters Junaid Nomani and Jakub Szefer Computer Architecture and Security Laboratory Yale University junaid.nomani@yale.edu

More information

A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510

A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510 A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510 Incentives for migrating to Exchange 2010 on Dell PowerEdge R720xd Global Solutions Engineering

More information

The Microkernel Overhead

The Microkernel Overhead The Micro Overhead http://d3s.mff.cuni.cz Martin Děcký decky@d3s.mff.cuni.cz CHARLES UNIVERSITY IN PRAGUE faculty of mathematics and physics Martin Děcký, FOSDEM 2012, 5 th February 2012 The Micro Overhead

More information

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto FlexSC Flexible System Call Scheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto Motivation The synchronous system call interface is a legacy from the single

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

TRIPS: Extending the Range of Programmable Processors

TRIPS: Extending the Range of Programmable Processors TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart

More information

Code optimization techniques

Code optimization techniques & Alberto Bertoldo Advanced Computing Group Dept. of Information Engineering, University of Padova, Italy cyberto@dei.unipd.it May 19, 2009 The Four Commandments 1. The Pareto principle 80% of the effects

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,

More information

Deterministic Memory Abstraction and Supporting Multicore System Architecture

Deterministic Memory Abstraction and Supporting Multicore System Architecture Deterministic Memory Abstraction and Supporting Multicore System Architecture Farzad Farshchi $, Prathap Kumar Valsan^, Renato Mancuso *, Heechul Yun $ $ University of Kansas, ^ Intel, * Boston University

More information

Intel MPI Library Conditional Reproducibility

Intel MPI Library Conditional Reproducibility 1 Intel MPI Library Conditional Reproducibility By Michael Steyer, Technical Consulting Engineer, Software and Services Group, Developer Products Division, Intel Corporation Introduction High performance

More information

Dealing with Asymmetry for Performance and Energy Efficiency

Dealing with Asymmetry for Performance and Energy Efficiency Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures

More information

Today s Papers. Composability is Essential. The Future is Parallel Software. EECS 262a Advanced Topics in Computer Systems Lecture 13

Today s Papers. Composability is Essential. The Future is Parallel Software. EECS 262a Advanced Topics in Computer Systems Lecture 13 EECS 262a Advanced Topics in Computer Systems Lecture 13 Resource allocation: Lithe/DRF October 16 th, 2012 Today s Papers Composing Parallel Software Efficiently with Lithe Heidi Pan, Benjamin Hindman,

More information

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to

More information

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core

More information