Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden

Similar documents
Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III.

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Computer Architecture ELEC3441

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Parallelism for Nested Loops with Non-uniform and Flow Dependences

ELEC 377 Operating Systems. Week 6 Class 3

Programming in Fortran 90 : 2017/2018

Cache Memories. Lecture 14 Cache Memories. Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management.

Optimizing Made Easy: ThreadSpotter Erik Hagersten, Chief Scientist

Assembler. Building a Modern Computer From First Principles.

Smoothing Spline ANOVA for variable screening

Array transposition in CUDA shared memory

The Codesign Challenge

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization

Machine Learning: Algorithms and Applications

Design and Analysis of Algorithms

Concurrent Apriori Data Mining Algorithms

Parallel matrix-vector multiplication

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

Giving credit where credit is due

A fair buffer allocation scheme

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Active Contours/Snakes

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

AADL : about scheduling analysis

CS221: Algorithms and Data Structures. Priority Queues and Heaps. Alan J. Hu (Borrowing slides from Steve Wolfman)

Collaboratively Regularized Nearest Points for Set Based Recognition

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Introduction to Programming. Lecture 13: Container data structures. Container data structures. Topics for this lecture. A basic issue with containers

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Efficient Distributed File System (EDFS)

Vectorization in the Polyhedral Model

Biostatistics 615/815

Wavefront Reconstructor

Quantifying Responsiveness of TCP Aggregates by Using Direct Sequence Spread Spectrum CDMA and Its Application in Congestion Control

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 5 Luca Trevisan September 7, 2017

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Memory Technology. Erik Hagersten Uppsala University, Sweden

CS 534: Computer Vision Model Fitting

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Quantifying Performance Models

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array

3D vector computer graphics

CS1100 Introduction to Programming

Feature Reduction and Selection

End-to-end Distortion Estimation for RD-based Robust Delivery of Pre-compressed Video

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

CPE 628 Chapter 2 Design for Testability. Dr. Rhonda Kay Gaede UAH. UAH Chapter Introduction

Gateway Algorithm for Fair Bandwidth Sharing

Support Vector Machines

Wishing you all a Total Quality New Year!

Reducing Frame Rate for Object Tracking

Sample Solution. Advanced Computer Networks P 1 P 2 P 3 P 4 P 5. Module: IN2097 Date: Examiner: Prof. Dr.-Ing. Georg Carle Exam: Final exam

MATHEMATICS FORM ONE SCHEME OF WORK 2004

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Multicore from an Application s Perspective. Erik Hagersten Uppsala Universitet

Intro. Iterators. 1. Access

Greedy Technique - Definition

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS

A DATA ANALYSIS CODE FOR MCNP MESH AND STANDARD TALLIES

Simulation Based Analysis of FAST TCP using OMNET++

Uppsala University, Sweden

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

An Optimal Algorithm for Prufer Codes *

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The AVL Balance Condition. CSE 326: Data Structures. AVL Trees. The AVL Tree Data Structure. Is this an AVL Tree? Height of an AVL Tree

Isosurface Extraction in Time-varying Fields Using a Temporal Hierarchical Index Tree

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

Advanced Computer Networks


Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

X- Chart Using ANOM Approach

Real-time interactive applications

Application of Maximum Entropy Markov Models on the Protein Secondary Structure Predictions

Optimizing Document Scoring for Query Retrieval

S1 Note. Basis functions.

LECTURE NOTES Duality Theory, Sensitivity Analysis, and Parametric Programming

Reading. 14. Subdivision curves. Recommended:

Beautiful & practical

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

Random Kernel Perceptron on ATTiny2313 Microcontroller

All-Pairs Shortest Paths. Approximate All-Pairs shortest paths Approximate distance oracles Spanners and Emulators. Uri Zwick Tel Aviv University

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT

MIXED INTEGER-DISCRETE-CONTINUOUS OPTIMIZATION BY DIFFERENTIAL EVOLUTION Part 1: the optimization method

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Why visualisation? IRDS: Visualization. Univariate data. Visualisations that we won t be interested in. Graphics provide little additional information

Inter-protocol fairness between

Shared Running Buffer Based Proxy Caching of Streaming Sessions

RADIX-10 PARALLEL DECIMAL MULTIPLIER

Transcription:

Optmzng for Speed Er Hagersten Uppsala Unversty, Sweden eh@t.uu.se What s the potental gan? Latency dfference L$ and mem: ~5x Bandwdth dfference L$ and mem: ~x Repeated TLB msses adds a factor ~-3x Execute from L$ nstead from mem ==> 5-5x mprovement At least a factor -x s wthn reach OPT Optmzng for cache performance What can go Wrong? A Smple Example Perform a dagonal copy tmes Keep the actve footprnt small Use the entre cache lne once t has been brought nto the cache Fetch a cache lne pror to ts usage Let the CPU that already has the data n ts cache do the ob... N N OPT 3 OPT

Example: Loop order Performance Dfference: Loop order //Optmzed Example A //Unoptmzed Example A for (=; <N; ++) { for (=; <N; ++) { A[][]= A[-][-]; for (=; <N; ++) { for (=; <N; ++) { A[][] = A[-][-];? Speedup vs UnOpt 8 6 8 6 Athlon6 x Pentum D Core Duo 6 3 6 8 56 5 8 96 Array sde OPT 5 OPT 6 Example: Sparse data //Optmzed Example A for (=; <N; ++) { for (=; <N; ++) { A_data[][]= A_data[-][-]; //Unoptmzed Example A for (=; <N; ++) { for (=; <N; ++) { A[][].data = A[-][-].data; dddd d d d d Performance Dfference: Sparse Data Speedup vs UnOPT 6 8 6 6 3 6 Athlon6 x Pentum D Core Duo Array sde 8 56 5 8 96 OPT 7 OPT 8

Loop Mergng Paddng of data structures /* Unoptmzed */ for ( = ; < N; = + ) for ( = ; < N; = + ) a[][] = * b[][]; for ( = ; < N; = + ) for ( = ; < N; = + ) c[][] = K * b[][] + d[][]/ Cachelne:? A lsb A+56*8 A+56**8 ndex 56 = (3) = (3) /* Optmzed */ for ( = ; < N; = + ) for ( = ; < N; = + ) a[][] = * b[][]; c[][] = K * b[][] + d[][]/; 56 & logc Ht? & () Select Multp (: m (3) Data OPT 9 OPT Paddng of data structures Cachelne:? A lsb A+56*8+paddng (7) A+56**8+*paddng ndex 56 (3) = (3) = Blocng /* Unoptmzed ARRAY: x = y * z */ for ( = ; < N; = + ) for ( = ; < N; = + ) {r = ; for ( = ; < N; = + ) r = r + y[][] * z[][]; x[][] = r; ; 56+paddng & & X: Y: Z: allocate more memory than needed logc Ht? S OPT OPT

Blocng /* Optmzed ARRAY: X = Y * Z */ for ( = ; < N; = + B) for ( = ; < N; = + B) for ( = ; < N; = + ) for ( = ; < mn(+b,n); = + ) {r = ; for ( = ; < mn(+b,n); = + ) r = r + y[][] * z[][]; x[][] += r; ; X: Partal soluton Y: OPT 3 Z: Frst bloc Second bloc Blocng: the Move! Partal soluton /* Optmzed ARRAY: X = Y * Z */ for ( = ; < N; = + B) /* Loop 5 */ for ( = ; < N; = + B) /* Loop */ for ( = ; < N; = + ) /* Loop 3 */ for ( = ; < mn(+b,n); = + ) /* Loop */ {r = ; for ( = ; < mn(+b,n); = + ) /* Loop */ r = r + y[][] * z[][]; X: x[][] += r; ; +B 5 Y: +B 3 3 OPT Z: +B Second bloc Frst bloc 5 +B Prefetchng Cache Affnty /* Unoptmzed */ for ( = ; < N; ++) for ( = ; < N; ++) x[][] = * x[][]; Schedule the process on the processor t last ran /* Optmzed */ for ( = ; < N; ++) for ( = ; < N; ++) PREFETCH x[+][] x[][] = * x[][]; Allocate and free data buffers n a LIFO order (Typcally, the HW prefetcher wll successfully prefetch sequental streams) OPT 5 OPT 6

Optmze for other caches TLB... Avod random accesses to huge data structs (Ex. Huge hashng table) Avod few access per page (very sparse data) Commercal Brea: Acumem s Multcore Tools Er Hagersten Uppsala Unversty, Sweden eh@t.uu.se OPT 7 Acumem SlowSpotter Source: C, C++, Fortran, OpenMP /* Unoptmzed Array Multplcaton: x = y * z N = */ for ( = ; < N; = + ) for ( = ; < N; = + ) {r = ; for ( = ; < N; = + ) r = r + y[][] * z[][]; x[][] = r; /* Unoptmzed Array Multplcaton: x = y * z N = */ for ( = ; < N; = + ) for ( = ; < N; = + ) {r = ; for ( = ; < N; = + ) r = r + y[][] * z[][]; x[][] = r; Any Compler Msson: Fnd the SlowSpots Asses ther mportance Enable for non-experts to fx them Improve the productvty of performance experts Acumem SlowSpotter Source: C, C++, Fortran... /* Unoptmzed Array Multplcaton: x = y * z N = */ for ( = ; < N; = + ) for ( = ; < N; = + ) {r = ; for ( = ; < N; = + ) r = r + y[][] * z[][]; x[][] = r; /* Unoptmzed Array Multplcaton: x = y * z N = */ for ( = ; < N; = + ) for ( = ; < N; = + ) {r = ; for ( = ; < N; = + ) r = r + y[][] * z[][]; x[][] = r; Any Compler What? How? Help! Msson: Fnd the Where? SlowSpots Asses ther mportance Enable for non-experts to fx them Improve the productvty of performance experts Sampler n Fnger Prnt (~MB) Sampler n Fnger Prnt (~MB) Analyss n Advce n Bnary Host System OPT 9 Bnary Host System OPT Target System Parameters

A One-Clc Report Generaton Fll n the followng felds: Applcaton to run Input arguments Worng dr (where to run the app) (Lmt, f you le, data gathered here, e.g., start gatherng after after sec. and stop after sec.) Mss rate Fetch rate Cache utlzaton Fracton of cache data utlzed Predcted fetch rate (f utlzaton %) Cache sze Clc ths button to create a report Cache sze of the target system for optmzaton (e.g., L or L sze) OPT OPT Loop Focus Tab Spottng the crme Lst of bad loops Cache sze to optmze for Explanng what to do OPT 3 OPT

Bandwdth Focus Tab Resource Sharng Example Spottng the crme Lbquantum A quantum computer smulaton Wdely used n research (download from: http://www.lbquantum.de/ ) + lnes of C, farly complex code. Runs an experment n ~3 mn Throughput mprovement: Lst of Bandwdth SlowSpots,5 Explanng what to do Relatve Throughput,5 3 Number of Cores Used OPT 5 OPT 6 6 Utlzaton Analyss Lbquantum Utlzaton Analyss Lbquantum Fetch rate Predcted fetch rate f utlzaton = % Cache utlzaton Fracton of cache data utlzed Orgnal Code.3% Cache sze data status data status data status data 3 status 3 record Only accessng status data n man loop Need 3 MB per thread! Fetch rate Predcted fetch rate f utlzaton = % Orgnal Code Cache utlzaton Fracton of cache data utlzed Cache sze Utlzaton Optmzaton for (=; ++; <MAX) {... = huge_data[].status +... for (=; ++; <MAX) {... = huge_data_status[] +... SlowSpotter s Frst Advce: Improve Utlzaton Change one data structure Involves ~ lnes of code Taes a non-expert 3 mn SlowSpotter s Frst Advce: Improve Utlzaton Change one data structure Involves ~ lnes of code Taes a non-expert 3 mn OPT 7 OPT 8

After Utlzaton Optmzaton Lbquantum Utlzaton Optmzaton Old fetch rate Orgnal Code Cache Utlzaton 95% Utlzaton Optmzaton Old fetch rate Orgnal Code Cache Utlzaton 95% Utlzaton Optmzaton Cache sze Predcted fetch rate New fetch rate Cache sze Predcted fetch rate New fetch rate Two postve effects from better utlzaton. Each fetch brngs n more useful data lower fetch rate. The same amount of useful data can ft n a smaller cache shft left OPT 9 OPT 3 Reuse Analyss Lbquantum Effect: Reuse Optmzaton SPEC CPU6-6.lbquantum Fetch rate Utlzaton Optmzaton Utlzaton + Fuson Optmzaton... toffol(huge_data,...) cnot(huge_data,......... fused_toffol_cnot(huge_data,...)... Old fetch rate Utlzaton Optmzaton New fetch rate Utlzaton + Fuson Optmzaton Second-Ffth SlowSpotter Advce: Improve reuse of data Fuse functons traversng the same data Here: four fused functons created Taes a non-expert < h The mss n the second loop goes away Stll need the same amount of cache to ft all data OPT 3 OPT 3

Utlzaton + Reuse Optmzaton Lbquantum Summary Lbquantum Old fetch rate Utlzaton Optmzaton New fetch rate Utlzaton + Fuson Optmzaton 5 Orgnal Utlzaton Optmzaton Utlzaton + Fuson.7x Throughput 3 Fetch rate down to.3% for MB Same as a 3 MB cache orgnally 3 # Cores Used OPT 33 OPT 3 3 Demo Orgnal Cgar Throughput Demo Tme! 3 Throughput Lbquantum: Org code Spatal opt Spat + Loop fuson Performance Edt-comple-analyss cycle mn OPT 35 Throughput scalablty s a dfferent way to loo at the performance of an applcaton. Here, several sngle-threaded nstances of the applcaton s run at the same tme. Even though the dfferent nstances do not explctly depend on each other, they wll nevertheless fght over the shared resources, e.g., runnng four threads on four cores mples that each thread wll get one quarter of the shared cache. A system usng four cores to run four nstances of Cgar wll actually result n a lower throughput than f only three cores were used. 3 # Cores OPT 36

Throughput Performance Intel Core (Intel Xeon E535) Throughput Performance (AMD s Istanbul) 33x 7x The optmzaton puts a much lower pressure on the shared cache resultng n a 33x better throughput for four cores. AMDs new sx-core Istanbul processor can enoy a 7x better throughput due to the optmzaton on sx cores OPT 37 OPT 38 Throughput Performance (Intel 7) 5,5 Normalzed Throughput 3x Cache sharng ssues 7,5 5,5 7,5 5,5 Orgnal Optmzed Er Hagersten Uppsala Unversty, Sweden eh@t.uu.se 3 5 6 7 8 # Threads Intel s new four-core 7 (Nehalem) processor enoy a 3x better throughput due to the Optmzaton on four cores. Note that each core can run up to two threads. OPT 39 OPT

Fghtng for shared resources Example: Hnts to avod cache polluton (non-temporal prefetches) Bnary Core Bnary cache msses The larger cache, the better $ wasted Mem st Order MC Performance Problems Addtonal multcore ssues: Even less cache resources per applcaton Sharng of cache resources Wasted cache usage x mssrate mssrate 3 One Instance actual/ Four Instances Hnt: Don t allocate! actual cache sze Throughput % faster Org Orgnal Lm=.7MB Hnt: lm= actual/ OPT OPT Example: Hnts for mxed worloads (non-temporal prefetches) Some performance tools Mss rate,5,,5,,5 streamng bgger s better tny 8 6 3 6 8 56 5 M M M 8M 6M áctual 3M 6M Lbquantum LBM bzp Cache sze Free lcenses Oprofle GNU: gprof AMD: code analyst Google performance tools Vrtual Inst: Hgh Productvty Supercomputng (http://www.v-hps.org/tools/) Sun Studo Performance,,8,6,, Indvdually In mx In mx, patched bzp Lbquantum LBM Geom mean 5% Not free Intel: Vtune and many more Alnea, TotalVew, (for MPI ) Acumem (of course ) HP: Multcore toolt (some free, some not) AMD Opteron OPT 3 OPT