Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting
|
|
- Ariel Harvey
- 6 years ago
- Views:
Transcription
1 Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting Animesh Jain, Michael A. Laurenzano, Lingjia Tang and Jason Mars International Symposium on Microarchitecture (MICRO), 2016 October 18, 2016
2 Rampant Dynamism in Datacenters Datacenters
3 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Datacenters
4 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Co-running of applications Datacenters
5 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Microarchitectural flexibility Co-running of applications Datacenters
6 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Microarchitectural flexibility Co-running of applications Platform diversity Datacenters
7 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Microarchitectural flexibility Co-running of applications Platform diversity Datacenters Dynamism affects the runtime availability of resources
8 Static Compiler Optimizations Compilation assumptions might not be met at runtime
9 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism
10 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse
11 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism
12 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal
13 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal
14 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal
15 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal Co-running application
16 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal Co-running application
17 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal Co-running application Partitioned cache
18 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal Co-running application Partitioned cache Different architecture
19 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Ideal Normal Co-running application Partitioned cache Different architecture
20 Co-runner Tiling Comparison Static vs Dynamic
21 Co-runner Tiling Comparison Static vs Dynamic
22 Co-runner Tiling Comparison Static vs Dynamic
23 Co-runner Tiling Comparison Static vs Dynamic Static vs Dynamic Static vs Dynamic
24 Co-runner Tiling Comparison Static vs Dynamic Static vs Dynamic Static vs Dynamic Dynamism requires rethinking cache tiling
25 Design Objectives Dynamic Should react to changes in runtime environment
26 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy
27 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead
28 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches
29 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches White-box approach BLAS libraries Dynamic Accuracy Low-overhead
30 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches Math kernel libraries like Intel MKL, ATLAS White-box approach BLAS libraries Dynamic Accuracy Low-overhead
31 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches Math kernel libraries like Intel MKL, ATLAS White-box approach BLAS libraries Dynamic Accuracy Low-overhead
32 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches Math kernel libraries like Intel MKL, ATLAS White-box approach BLAS libraries Dynamic Accuracy Low-overhead Online generation of a black-box model
33 Shape Shifter
34 Key Components Dynamic tile generation Tiled loop Application 1
35 Key Components Dynamic tile generation Companion thread (Protean Code + Polly) Tiled loop Code cache Dynamic compiler Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008
36 Key Components Dynamic tile generation Detect tiling opportunities Companion thread (Protean Code + Polly) Tiled loop Code cache Dynamic compiler Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008
37 Key Components Dynamic tile generation Detect tiling opportunities Companion thread (Protean Code + Polly) Runtime Environment Monitor (REM) Tiled loop Dynamic compiler REM Code cache Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008
38 Key Components Dynamic tile generation Detect tiling opportunities Find a high-performant tile Companion thread (Protean Code + Polly) Runtime Environment Monitor (REM) Tiled loop Dynamic compiler REM Code cache Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008
39 Key Components Dynamic tile generation Detect tiling opportunities Find a high-performant tile Companion thread (Protean Code + Polly) Runtime Environment Monitor (REM) Tile Selector Tiled loop Code cache Dynamic compiler REM Tile selector Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008
40 Key Components Dynamic tile generation Detect tiling opportunities Find a high-performant tile Companion thread (Protean Code + Polly) Runtime Environment Monitor (REM) Tile Selector Tiled loop Code cache Dynamic compiler Z Z 1 2 Companion controller REM Tile selector Z Z Application 1 Companion 1 Application 2 Companion 2 ShapeShifter Protean Code, MICRO 2014 and Polly, PLDI 2008
41 Overview Dynamic compiler Tile selector REM
42 Overview Online training select tile size and generate training data Dynamic compiler Tile selector REM Online training Find tile size Training set Collect cache stats
43 Overview Online training select tile size and generate training data Tile selection generate black-box model and select suitable tile shape Dynamic compiler Tile selector REM Online training Tile selection Find tile size Training set Collect cache stats Tile performance model Choose tile
44 Overview Online training select tile size and generate training data Tile selection generate black-box model and select suitable tile shape Monitored execution detect tiling opportunities Dynamic compiler Tile selector REM Online training Find tile size Training set Collect cache stats Tile performance model Tile selection Choose tile Monitored execution
45 Overview Online training select tile size and generate training data Tile selection generate black-box model and select suitable tile shape Monitored execution detect tiling opportunities Dynamic compiler Tile selector REM Online training Find tile size Training set Collect cache stats Tile performance model Tile selection Choose tile Monitored execution Runtime environment change
46 Overview Online training select tile size and generate training data Tile selection generate black-box model and select suitable tile shape Monitored execution detect tiling opportunities Dynamic compiler Tile selector REM Online training Find tile size Training set Collect cache stats Tile performance model Tile selection Choose tile Monitored execution Runtime environment change
47 Tile Selection Black-box model is generated online
48 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Training data Black-box model
49 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Training data IPC Tile parameters Black-box model
50 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Training data IPC Tile parameters Black-box model
51 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Predicts a tile suitable to current runtime environment Training data IPC IPC pred Tile parameters Set of tile shapes of predicted size Black-box model
52 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Predicts a tile suitable to current runtime environment Training data IPC IPC max IPC pred Set of tile shapes of predicted size Tile parameters Black-box model T shapeshifter
53 Insight for Co-optimization Challenging to retile multiple applications simultaneously
54 Insight for Co-optimization Challenging to retile multiple applications simultaneously Tile shape and tile size contribute differently to cache interference
55 Insight for Co-optimization Challenging to retile multiple applications simultaneously Tile shape and tile size contribute differently to cache interference
56 Insight for Co-optimization Challenging to retile multiple applications simultaneously Tile shape and tile size contribute differently to cache interference Co-optimization Find tile size for apps and then tile shape one-by-one
57 Experimental Evaluation
58 Methodology Polybench application suite
59 Methodology Polybench application suite Three sources of dynamism Co-running applications Microarchitectural flexibility cache partitioning Platform diversity
60 Methodology Polybench application suite Three sources of dynamism Co-running applications Microarchitectural flexibility cache partitioning Platform diversity Three platforms AMD Bulldozer Intel Haswell Intel Atom
61 Methodology Polybench application suite Three sources of dynamism Co-running applications Microarchitectural flexibility cache partitioning Platform diversity Three platforms AMD Bulldozer Intel Haswell Intel Atom Tiling is performed in the shared cache
62 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner
63 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner
64 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner
65 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner Co-runner change syr2k to correlation
66 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner Co-runner change syr2k to correlation Change in cache allocations
67 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)
68 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)
69 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)
70 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)
71 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)
72 Platform Diversity Platform diversity Intel Atom, Intel Haswell and AMD Bulldozer Static Best best tile on AMD Bulldozer
73 Platform Diversity Platform diversity Intel Atom, Intel Haswell and AMD Bulldozer Static Best best tile on AMD Bulldozer
74 Platform Diversity Platform diversity Intel Atom, Intel Haswell and AMD Bulldozer Static Best best tile on AMD Bulldozer
75 Conclusions ShapeShifter an end to end dynamic loop co-optimization
76 Conclusions ShapeShifter an end to end dynamic loop co-optimization Adapt tiling strategy to the application runtime environment
77 Conclusions ShapeShifter an end to end dynamic loop co-optimization Adapt tiling strategy to the application runtime environment Loop co-optimization tiling multiple applications on the fly
78 Conclusions ShapeShifter an end to end dynamic loop co-optimization Adapt tiling strategy to the application runtime environment Loop co-optimization tiling multiple applications on the fly Novel black-box modelling approach fast and accurate
79 Conclusions ShapeShifter an end to end dynamic loop co-optimization Adapt tiling strategy to the application runtime environment Loop co-optimization tiling multiple applications on the fly Novel black-box modelling approach fast and accurate ShapeShifter achieves significant performance improvements across different sources of dynamism
80 Q/A
81 Why black-box model works? There is trade-off between the best tiling stragey and performance We show that SS chooses a close one Why 3 D tiling? Build on Polly but technique is not restricted to 3D tiling Also memorize the compilation times 2 reasons of slowdown tile doesn t matter, black-box model not good enough Remember cache sizes Prior work refresh 18
82 Overhead Companion thread Three sources of overhead Dynamic Compilation 136 ms on Intel Haswell, 430 ms on AMD Bulldozer Code redirection Training 19
83 Overhead training 20
84 Black-box model Multiple high-performance tiles ShapeShifter chooses one of the high-performanc e tiles 21
85 ShapeShifter vs Dynamic Oracle ShapeShifter achieves 93% of the dynamic oracle performance on average 22
86 Co-runner 23
Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting
Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting Animesh Jain, Michael A. Laurenzano, Lingjia Tang and Jason Mars University of Michigan, Ann Arbor {anijain,
More informationSMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers
SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers Yunqi Zhang, Michael A. Laurenzano, Jason Mars, Lingjia Tang Clarity-Lab Electrical Engineering
More informationGaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems
Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Presented by Hadeel Alabandi Introduction and Motivation 2 A serious issue to the effective utilization
More informationDatacenter application interference
1 Datacenter application interference CMPs (popular in datacenters) offer increased throughput and reduced power consumption They also increase resource sharing between applications, which can result in
More informationBLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg
University of Freiburg BLIS retreat September 19, 217 Basic Linear Algebra Subroutines For Embedded Optimization performance dgemm_nt 5 4 Intel Core i7 48MQ HP OpenBLAS.2.19 MKL 217.2.174 ATLAS 3.1.3 BLIS.1.6
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationCode transformations for energy efficiency; a decoupled accessexecute
Code transformations for energy efficiency; a decoupled accessexecute approach Work performed at Uppsala University Konstantinos Koukos November 2016 OVERALL GOAL The big goal Better exploit the potential
More informationAn Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors
An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors Jiacheng Zhao Institute of Computing Technology, CAS In Conjunction with Prof. Jingling Xue, UNSW, Australia
More informationVirtual Melting Temperature: Managing Server Load to Minimize Cooling Overhead with Phase Change Materials
Virtual Melting Temperature: Managing Server Load to Minimize Cooling Overhead with Phase Change Materials Matt Skach1, Manish Arora2,3, Dean Tullsen3, Lingjia Tang1, Jason Mars1 University of Michigan1
More informationLow-overhead Online Code Transformations
Low-overhead Online Code Transformations by Michael A. Laurenzano A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering)
More informationAddressing Memory Bottlenecks for Emerging Applications
Addressing Memory Bottlenecks for Emerging Applications by Animesh Jain A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and
More informationMitigating Resource Contention in Warehouse Scale Computers
1 Mitigating Resource Contention in Warehouse Scale Computers A Dissertation Presented to the faculty of the School of Engineering and Applied Science University of Virginia In Partial Fulfillment of the
More information*Yuta SAWA and Reiji SUDA The University of Tokyo
Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi,
More informationSirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers
Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Cheng Li, Austin Rovinski,
More informationLixia Liu, Zhiyuan Li Purdue University, USA. grants ST-HEC , CPA and CPA , and by a Google Fellowship
Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009 Work supported in part by NSF through Work supported in part by NSF through grants ST-HEC-0444285, CPA-0702245 and CPA-0811587, and
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationExploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems
Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems J.C. Sáez, A. Pousa, F. Castro, D. Chaver y M. Prieto Complutense University of Madrid, Universidad Nacional de la Plata-LIDI
More informationOptimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications
Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory
More informationManaging GPU Concurrency in Heterogeneous Architectures
Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures
More informationGraphics Performance Analyzer for Android
Graphics Performance Analyzer for Android 1 What you will learn from this slide deck Detailed optimization workflow of Graphics Performance Analyzer Android* System Analysis Only Please see subsequent
More informationMulti-core Programming Evolution
Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution
More informationDavid R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
More informationAMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016
AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING BILL.BRANTLEY@AMD.COM, FELLOW 3 OCTOBER 2016 AMD S VISION FOR EXASCALE COMPUTING EMBRACING HETEROGENEITY CHAMPIONING OPEN SOLUTIONS ENABLING LEADERSHIP
More informationTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems
More informationAdvanced Computer Architecture
Advanced Computer Architecture 1 L E C T U R E 0 J A N L E M E I R E Course Objectives 2 Intel 4004 1971 2.3K trans. Intel Core 2 Duo 2006 291M trans. Where have all the transistors gone? Turing Machine
More informationDefensive Loop Tiling for Shared Cache. Bin Bao Adobe Systems Chen Ding University of Rochester
Defensive Loop Tiling for Shared Cache Bin Bao Adobe Systems Chen Ding University of Rochester Bird and Program Unlike a bird, which can learn to fly better and better, existing programs are sort of dumb---the
More informationIterative Compilation with Kernel Exploration
Iterative Compilation with Kernel Exploration Denis Barthou 1 Sébastien Donadio 12 Alexandre Duchateau 1 William Jalby 1 Eric Courtois 3 1 Université de Versailles, France 2 Bull SA Company, France 3 CAPS
More informationChapter 18 - Multicore Computers
Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca
More informationTHE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017
THE FUTURE OF GPU DATA MANAGEMENT Michael Wolfe, May 9, 2017 CPU CACHE Hardware managed What data to cache? Where to store the cached data? What data to evict when the cache fills up? When to store data
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationAn Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors
ACM IEEE 37 th International Symposium on Computer Architecture Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors Enric Herrero¹, José González²,
More informationReplacement policies for shared caches on symmetric multicores : a programmer-centric point of view
1 Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view Pierre Michaud INRIA HiPEAC 11, January 26, 2011 2 Outline Self-performance contract Proposition for
More informationHalf full or half empty? William Gropp Mathematics and Computer Science
Half full or half empty? William Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp MPI on Multicore Processors Work of Darius Buntinas and Guillaume Mercier 340 ns MPI ping/pong latency More
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationEN105 : Computer architecture. Course overview J. CRENNE 2015/2016
EN105 : Computer architecture Course overview J. CRENNE 2015/2016 Schedule Cours Cours Cours Cours Cours Cours Cours Cours Cours Cours 2 CM 1 - Warmup CM 2 - Computer architecture CM 3 - CISC2RISC CM 4
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationPostprint. This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden.
http://www.diva-portal.org Postprint This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. Citation for the original published paper: Ceballos, G., Black-Schaffer,
More informationLecture 21: Parallelism ILP to Multicores. Parallel Processing 101
18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture
More informationNeural Network Assisted Tile Size Selection
Neural Network Assisted Tile Size Selection Mohammed Rahman, Louis-Noël Pouchet and P. Sadayappan Dept. of Computer Science and Engineering Ohio State University June 22, 2010 iwapt 2010 Workshop Berkeley,
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationAchieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017
Achieving Peak Performance on Intel Hardware Intel Software Developer Conference London, 2017 Welcome Aims for the day You understand some of the critical features of Intel processors and other hardware
More informationExploration of Cache Coherent CPU- FPGA Heterogeneous System
Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based
More informationSYSTEM REQUIREMENTS M.APP ENTERPRISE
SYSTEM REQUIREMENTS M.APP ENTERPRISE Description or Document Category October 06, 2016 Contents M.App Enterprise Server... 3 Hardware requirements... 3 Disk space requirements... 3 Production environment
More informationComputer Architecture
Computer Architecture Lecture 1: Introduction and Basics Dr. Ahmed Sallam Suez Canal University Spring 2016 Based on original slides by Prof. Onur Mutlu I Hope You Are Here for This Programming How does
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationIntel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth
Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel Visual Fortran Compiler Professional Edition for Windows*........................ 3 Features...3 New in This
More information15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15
More informationSimulating Stencil-based Application on Future Xeon-Phi Processor
Simulating Stencil-based Application on Future Xeon-Phi Processor PMBS workshop at SC 15 Chitra Natarajan Carl Beckmann Anthony Nguyen Intel Corporation Intel Corporation Intel Corporation Mauricio Araya-Polo
More informationCompilation for Heterogeneous Platforms
Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey
More informationHigh-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs
High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:
More informationAnastasia Ailamaki. Performance and energy analysis using transactional workloads
Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:
More informationMilind Kulkarni Research Statement
Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers
More informationAdaptive Cache Partitioning on a Composite Core
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,
More informationVirtualization. Dr. Yingwu Zhu
Virtualization Dr. Yingwu Zhu Virtualization Definition Framework or methodology of dividing the resources of a computer into multiple execution environments. Types Platform Virtualization: Simulate a
More informationTales of the Tail Hardware, OS, and Application-level Sources of Tail Latency
Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationSimone Campanoni Loop transformations
Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple
More informationParallelism and runtimes
Parallelism and runtimes Advanced Course on Compilers Spring 2015 (III-V): Lecture 7 Vesa Hirvisalo ESG/CSE/Aalto Today Parallel platforms Concurrency Consistency Examples of parallelism Regularity of
More informationDistributed Systems COMP 212. Lecture 18 Othon Michail
Distributed Systems COMP 212 Lecture 18 Othon Michail Virtualisation & Cloud Computing 2/27 Protection rings It s all about protection rings in modern processors Hardware mechanism to protect data and
More informationRethinking the Architecture of Warehouse-Scale Computers
1 Rethinking the Architecture of Warehouse-Scale Computers A Dissertation Presented to the faculty of the School of Engineering and Applied Science University of Virginia In Partial Fulfillment of the
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia
More informationAdministration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:
CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.126A ACES Email: pingali@cs.utexas.edu TA: Aditya Rawal Email: 83.aditya.rawal@gmail.com University of Texas,
More informationPYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads
PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads Ran Xu (Purdue), Subrata Mitra (Adobe Research), Jason Rahman (Facebook), Peter Bai (Purdue),
More informationBlurred Persistence in Transactional Persistent Memory
Blurred Persistence in Transactional Persistent Memory Youyou Lu, Jiwu Shu, Long Sun Tsinghua University Overview Problem: high performance overhead in ensuring storage consistency of persistent memory
More informationRUBIK: FAST ANALYTICAL POWER MANAGEMENT
RUBIK: FAST ANALYTICAL POWER MANAGEMENT FOR LATENCY-CRITICAL SYSTEMS HARSHAD KASTURE, DAVIDE BARTOLINI, NATHAN BECKMANN, DANIEL SANCHEZ MICRO 2015 Motivation 2! Low server utilization in today s datacenters
More informationUsing Fast and Accurate Simulation to Explore Hardware/Software Trade-offs in the Multi-Core Era
Using Fast and Accurate Simulation to Explore Hardware/Software Trade-offs in the Multi-Core Era Wim HEIRMAN a,c,1, Trevor E. CARLSON a,c Souradip SARKAR a,c Pieter GHYSELS b,c Wim VANROOSE b Lieven EECKHOUT
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationPractical High Performance Computing
Practical High Performance Computing Donour Sizemore July 21, 2005 2005 ICE Purpose of This Talk Define High Performance computing Illustrate how to get started 2005 ICE 1 Preliminaries What is high performance
More informationData Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich
Data Modeling and Databases Ch 10: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application
More informationOpportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory
Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Jongsoo Park, Parallel Computing Lab, Intel Corporation with contributions from MKL team 1 Algorithm/
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationIntel VTune Amplifier XE
Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance
More informationModern systems: multicore issues
Modern systems: multicore issues By Paul Grubbs Portions of this talk were taken from Deniz Altinbuken s talk on Disco in 2009: http://www.cs.cornell.edu/courses/cs6410/2009fa/lectures/09-multiprocessors.ppt
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,
PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution
More informationData Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich
Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application
More informationMRPB: Memory Request Priori1za1on for Massively Parallel Processors
MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University Benefits of GPU Caches
More informationAdministration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture
CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.26A ACES Email: pingali@cs.utexas.edu TA: Xin Sui Email: xin@cs.utexas.edu University of Texas, Austin Fall
More informationAdministration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers
Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting
More informationLock vs. Lock-free Memory Project proposal
Lock vs. Lock-free Memory Project proposal Fahad Alduraibi Aws Ahmad Eman Elrifaei Electrical and Computer Engineering Southern Illinois University 1. Introduction The CPU performance development history
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationHigh Performance Ocean Modeling using CUDA
using CUDA Chris Lupo Computer Science Cal Poly Slide 1 Acknowledgements Dr. Paul Choboter Jason Mak Ian Panzer Spencer Lines Sagiv Sheelo Jake Gardner Slide 2 Background Joint research with Dr. Paul Choboter
More informationVirtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])
EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,
More informationPredicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters
Predicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters Junaid Nomani and Jakub Szefer Computer Architecture and Security Laboratory Yale University junaid.nomani@yale.edu
More informationA Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510
A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510 Incentives for migrating to Exchange 2010 on Dell PowerEdge R720xd Global Solutions Engineering
More informationThe Microkernel Overhead
The Micro Overhead http://d3s.mff.cuni.cz Martin Děcký decky@d3s.mff.cuni.cz CHARLES UNIVERSITY IN PRAGUE faculty of mathematics and physics Martin Děcký, FOSDEM 2012, 5 th February 2012 The Micro Overhead
More informationFlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto
FlexSC Flexible System Call Scheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto Motivation The synchronous system call interface is a legacy from the single
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationTRIPS: Extending the Range of Programmable Processors
TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart
More informationCode optimization techniques
& Alberto Bertoldo Advanced Computing Group Dept. of Information Engineering, University of Padova, Italy cyberto@dei.unipd.it May 19, 2009 The Four Commandments 1. The Pareto principle 80% of the effects
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationEECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun
EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,
More informationDeterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System Architecture Farzad Farshchi $, Prathap Kumar Valsan^, Renato Mancuso *, Heechul Yun $ $ University of Kansas, ^ Intel, * Boston University
More informationIntel MPI Library Conditional Reproducibility
1 Intel MPI Library Conditional Reproducibility By Michael Steyer, Technical Consulting Engineer, Software and Services Group, Developer Products Division, Intel Corporation Introduction High performance
More informationDealing with Asymmetry for Performance and Energy Efficiency
Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures
More informationToday s Papers. Composability is Essential. The Future is Parallel Software. EECS 262a Advanced Topics in Computer Systems Lecture 13
EECS 262a Advanced Topics in Computer Systems Lecture 13 Resource allocation: Lithe/DRF October 16 th, 2012 Today s Papers Composing Parallel Software Efficiently with Lithe Heidi Pan, Benjamin Hindman,
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationRevisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison
Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to
More informationThread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core
More information