Kampala August, Agner Fog

Similar documents
Martin Kruliš, v

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

Some features of modern CPUs. and how they help us

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

High Performance Computing Lecture 1. Matthew Jacob Indian Institute of Science

NAN propagation versus fault trapping in floating point code

Performance analysis basics

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.

Using Intel VTune Amplifier XE and Inspector XE in.net environment

1. Optimizing software in C++ An optimization guide for Windows, Linux and Mac platforms

1. Optimizing software in C++ An optimization guide for Windows, Linux and Mac platforms

High Performance Computing: Tools and Applications

Introduction to C. Why C? Difference between Python and C C compiler stages Basic syntax in C

Byte Ordering. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.


Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Hardware-Based Speculation

Things to know about Numeric Computation

Advanced Processor Architecture

1. Optimizing software in C++ An optimization guide for Windows, Linux and Mac platforms

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Darek Mihocka, Emulators.com Stanislav Shwartsman, Intel Corp. June

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Code optimization techniques

Compiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005

High Performance Computing and Programming, Lecture 3

Announcements. Lab Friday, 1-2:30 and 3-4:30 in Boot your laptop and start Forte, if you brought your laptop

Byte Ordering. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Fixed-Point Math and Other Optimizations

Introduction to RISC-V

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

General introduction: GPUs and the realm of parallel architectures

There are 16 total numbered pages, 7 Questions. You have 2 hours. Budget your time carefully!

Tour of common optimizations

Porting Linux to x86-64

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

COS 318: Operating Systems

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Microarchitecture Overview. Performance

Caches Concepts Review

Programmazione Avanzata

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Optimising with the IBM compilers

Writing Efficient Programs

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

WIND RIVER DIAB COMPILER

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Communicating with People (2.8)

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Microarchitecture Overview. Performance

COE608: Computer Organization and Architecture

Great Reality #2: You ve Got to Know Assembly Does not generate random values Arithmetic operations have important mathematical properties

Digital Design and Computer Architecture Harris and Harris, J. Spjut Elsevier, 2007

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

ECE 498 Linux Assembly Language Lecture 1

Cost of Your Programs

Threads and Too Much Milk! CS439: Principles of Computer Systems January 31, 2018

Simplified and Effective Serial and Parallel Performance Optimization

EJEMPLOS DE ARQUITECTURAS

Kaisen Lin and Michael Conley

Supercomputing in Plain English Part IV: Henry Neeman, Director

Intel Parallel Studio XE 2015

Important From Last Time

Page 1. Today. Important From Last Time. Is the assembly code right? Is the assembly code right? Which compiler is right?

Parallel Programming

Important From Last Time

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

ECE 250 / CS 250 Computer Architecture. C to Binary: Memory & Data Representations. Benjamin Lee

Optimization Prof. James L. Frankel Harvard University

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Intel released new technology call P6P

Optimisation. CS7GV3 Real-time Rendering

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Optimized Scientific Computing:

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Operating Systems CMPSCI 377, Lec 2 Intro to C/C++ Prashant Shenoy University of Massachusetts Amherst

COMP3221: Microprocessors and. and Embedded Systems. Instruction Set Architecture (ISA) What makes an ISA? #1: Memory Models. What makes an ISA?

Assembly Language Programming Optimization

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Memory Management. q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory

Lecture 2: SML Basics

Chapter 4: Threads. Chapter 4: Threads

What s New August 2015

Lecture 9 Dynamic Compilation

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Parallel Computing Architectures

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

OpenACC Fundamentals. Steve Abbott November 15, 2017

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Data Representation Type of Data Representation Integers Bits Unsigned 2 s Comp Excess 7 Excess 8

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

CS 537 Lecture 6 Fast Translation - TLBs

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

Transcription:

Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org

Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler and function libraries Find the bottlenecks, profiling Cache and memory allocation Floating point, exceptions Parallelization: threads, vector instructions Discussion

Intel Core2 microarchitecture

AMD microarchitecture

Out-Of-Order Execution x = a / b; y = c * d; z = x + y;

Register renaming R1 = mem1 (cached) R2 = mem2 (not cached) R3 = mem3 (cached) R2 = R2 + R1 R1 = R1 + R3 mem4 = R2 (= mem1 + mem2) mem5 = R1 (= mem1 + mem3) 8 logical registers, 96 physical registers

Branch prediction Loop Branch A B

Choice of platform Windows Linux Mac Which microprocessor? 32 bit 64 bit Graphics coprocessor

Pros and Cons of 64 bit Pros Number of registers is doubled Function parameters transferred in registers More efficient memory allocation Cons Pointers and stack take more space Some instructions take a little more space

Choice of programming language Wizards, etc. Java, C#, VB C++ C C++ with low level language Assembly development time - performance

Choice of compiler Microsoft Intel Gnu Borland

CPU dispatching Genuine Intel N J SSE J N Generic x86 SSE2 N SSE code J SSE3 J N SSE2 code SSE3 code

Typical bottlenecks Start up Databases Network File input / output RAM access, cache utilization Algorithm Dependency chains CPU pipeline CPU execution units Speed

Finding the bottlenecks Profilers: Microsoft Intel VTune (CPU specific) AMD CodeAnalyst (CPU specific) Your own instrumentation: Insert time measurements in code (single thread) Test with a realistic data set

File input / output Limit use of network resources Limit number of program files, DLLs, configuration files, resource files, etc. Compress large data files. Binary vs. ASCII Read and write sequentially Large data blocks

Static vs. dynamic linking Static linking (*.lib, *.a) Necessary functions are copied into exe file Dynamic linking (*.dll, *.so) The whole function library is loaded Lazy binding The function address is inserted the first time the function is called

Problems with dynamic linking Functions distributed in separate file The whole function library is loaded into RAM even when only a single function is needed RAM gets fragmented by many DLLs Round memory addresses compete for the same cache lines Function calls go via pointers in import table

Set-associative cache Core2: 64 sets * 8 ways * 64 bytes = 32 kb AMD: 512 sets * 2 ways * 64 bytes = 64 kb Level 2: 16 ways, 0.5 4 Mb

Memory allocation Rather one large block than many small Avoid linked lists STL vs. your own container classes Large powers of 2 give cache contentions

What is used together should be saved together Parent class -Attribute1 -Attribute2 +Method1() +Method2() Child class -Attribute3 -Attribute4 +Method3() +Method4()

What is used together should be saved together (2) int a[1000], b[1000], c[1000], d[1000]; for (i = 0; i < 1000; i++) a[i] = b[i] * c[i] + d[i]; struct abcd {int a; int b; int c; int d;}; abcd LL[1000]; for (i = 0; i < 1000; i++) LL.a[i] = LL.b[i] * LL.c[i] + LL.d[i];

Dependency chains x = a + b + c + d; x = (a + b) + (c + d); Most important with floating point

Loop-carried dependency chain for (i = 0; i < n; i++) { sum += x[i]; } for (i = 0; i < n; i += 2) { sum1 += x[i]; sum2 += x[i+1]; } sum = sum1 + sum2;

What can the compiler do for you? Constant propagation: a = 1.; b += a + 2. / 3; Becomes: b += 1.66666666666667;

What can the compiler do for you? (2) Induction variable: for (i=0; i<n; i++) arr[i] = 10+3*i; Becomes: for (i=0, t=10; i<n; i++, t+=3) arr[i] = t;

What can the compiler do and what can it not do? (3) Common subexpression: x = a + b + 5 + c; y = (c + b + a) * (b + a + c); Help compiler by writing: x = (a + b + c) + 5; y = (a + b + c) * (a + b + c);

What can the compiler do and what can it not do? (4) Loop-invariant expression: for (i=0; i<n; i++) arr[i] /= a + b; Compiler will compute a+b outside loop, but not 1./(a+b)

Floating point: Don t mix single and double precision float a, b;... a = b + 0.1; 0.1f;

Exceptions Exceptions are expensive, even when they don t occur Overflow and NAN: Prevent outside loop Catch exception Propagate to end result Let program stop Underflow: Flush to zero (SSE2)

Virtual functions class C0 { virtual void f(); }; class C1 : public C0 { virtual void f(); }; C0 * p; p -> f();

Parallelization methods Parallel threads on multiple cores Microprocessor can execute up to four instructions simultaneously in each core Vector instructions: 2 double or 4 single precision floating point

Parallel threads Start threads explicitly OpenMP instructions Compiler-generated parallelization Coarse-grained vs. fine-grained Shared L2 or L3 cache

Vector instructions

Vector types 128 bit register divided into: 16 * 8 bit integer 8 * 16 bit integer 4 * 32 bit integer 2 * 64 bit integer 4 * 32 bit float 2 * 64 bit double

Mathematical functions exp, log, sin, cos, etc.: Standard math libraries SSE2 enabled libraries SSE2 vector libraries

Coding of vector instructions Assembler Inline assembly Intrinsic functions Vector classes Automatic generation by compiler performance development time

Separate assembly Advantages: Everything is possible Disadvantages: Need to know all conventions Bugs easy to make and difficult to find Long and complicated Platform dependent 900 instructions

Inline assembly Advantages: The compiler takes care of all conventions Easier cross-platform porting Disadvantages: Limited possibilities No error checking 900 instructions

Intrinsic functions Advantages: Compiler takes care of conventions Compiler takes care of register allocation Compiler can optimize further Easy porting to all x86 platforms Disadvantages: Hundreds of functions with long names Code becomes unreadable

Vector classes Advantages: Same advantages as intrinsic functions Clear code with well-known operators Use predefines classes or your own Disadvantages: Data must be defined as 128-bit vectors Need to develop class library Limited number of operators

Vector class example double a[2], b[2], c[2]; for (i=0; i<2; i++) a[i] = b[i] + c[i] * 5.; F64vec2 a, b, c; a = b + c * F64vec2(5.);

Automatic vectorization Advantages: The compiler does all the work Disadvantages: Can only optimize easily recognizable structures Compiler doesn t know if data size is divisible by vector size Compiler doesn t know which loops run many or few times Programmer must insert #pragmas to tell which pointers are aligned, etc.

The future More processor cores 64 bit Application specific coprocessors SSE4 instruction set Programmable logic in CPU?

Optimize only the narrowest bottleneck Start up Databases Network File input / output RAM access, cache utilization Algorithm Dependency chains CPU pipeline CPU execution units Speed