High Performance Computing and Programming, Lecture 3

Similar documents
Program Optimization

Storage Management 1

Advanced Programming & C++ Language

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory

Cache Lab Implementation and Blocking

211: Computer Architecture Summer 2016

QUIZ How do we implement run-time constants and. compile-time constants inside classes?

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

CSE351 Winter 2016, Final Examination March 16, 2016

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CS3350B Computer Architecture

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Lectures 13 & 14. memory management

Weeks 6&7: Procedures and Parameter Passing

CS422 Computer Architecture

Declaration Syntax. Declarations. Declarators. Declaration Specifiers. Declaration Examples. Declaration Examples. Declarators include:

Computer Architecture Spring 2016

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

HPC VT Machine-dependent Optimization

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Key Point. What are Cache lines

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Memory Management! Goals of this Lecture!

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

Review! Lecture 5 C Memory Management !

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

CS201 Some Important Definitions

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11

CS61C : Machine Structures

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

AMath 483/583 Lecture 11

CS201 - Introduction to Programming Glossary By

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

CSC 1600 Memory Layout for Unix Processes"

Motivation was to facilitate development of systems software, especially OS development.

CS61C : Machine Structures

Contents of Lecture 3

Cache Optimisation. sometime he thought that there must be a better way

3/3/2014! Anthony D. Joseph!!CS162! UCB Spring 2014!

CS222: Cache Performance Improvement

In Java we have the keyword null, which is the value of an uninitialized reference type

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Systems Programming and Computer Architecture ( ) Timothy Roscoe

G Programming Languages - Fall 2012

Advanced optimizations of cache performance ( 2.2)

Short Notes of CS201

Engine Support System. asyrani.com

CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement"

Lecture 11 Reducing Cache Misses. Computer Architectures S

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Reliable C++ development - session 1: From C to C++ (and some C++ features)

CS3350B Computer Architecture

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

CS61C : Machine Structures

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Pointers, Dynamic Data, and Reference Types

Today Cache memory organization and operation Performance impact of caches

Virtual Memory Oct. 29, 2002

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Levels in memory hierarchy

Symbol Tables. ASU Textbook Chapter 7.6, 6.5 and 6.3. Tsan-sheng Hsu.

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Fundamentals of Computer Systems

CS 2461: Computer Architecture 1

Lecture 14: more class, C++ streams

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Page 1. Memory Hierarchies (Part 2)

CS61C Midterm Review on C & Memory Management

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Operating Systems CMPSCI 377, Lec 2 Intro to C/C++ Prashant Shenoy University of Massachusetts Amherst

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Giving credit where credit is due

Dynamic Memory: Alignment and Fragmentation

Lecture 2. Memory locality optimizations Address space organization

CISC 360. Cache Memories Nov 25, 2008

ECE 571 Advanced Microprocessor-Based Design Lecture 13

Object-Oriented Programming for Scientific Computing

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Lecture 13: more class, C++ memory management

Cache Memories October 8, 2007

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Motivations for Virtual Memory Virtual Memory Oct. 29, Why VM Works? Motivation #1: DRAM a Cache for Disk

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

CS241 Computer Organization Spring Principle of Locality

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lecture 14 Notes. Brent Edmunds

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

The Processor Memory Hierarchy

Transcription:

High Performance Computing and Programming, Lecture 3 Memory usage and some other things Ali Dorostkar Division of Scientific Computing, Department of Information Technology, Uppsala University, Sweden 2015

About Me Ali Dorostkar, PhD at the Division of Scientific Computing. Room: 2350 ali.dorostkar@it.uu.se Things I am working on: Numerical solution methods High performance computing Glacial isostatic adjustment

Outline Things covered in this lecture: Memory usage malloc and free Cache, data locality, blocking techniques Function side effects, pure functions The const keyword The restrict keyword

Please ask questions Please interrupt whenever there are questions!

Reading Reading: Fog 7.1 : Different kinds of variable storage Fog 8.3 : Obstacles to optimization by compiler Fog 9 : Optimizing memory access Oliveira 8 : Data structures Oliveira 12: Computer architecture and effiency

Simplified model of computer CPU Memory Hard disk Work that can be done inside CPU is fast. Access to memory is slow. Access to hard disk is even slower.

Parts of a computer

Details inside CPU Cache memory is slower than register

Example of memory hierachy

C language works with memory addresses The C language syntax lets us work with pointers to memory addresses. However, a C program does typically not explicitly decide what is stored in cache and in registers. register keyword hints the compiler to use very fast memory if possible. i.e. register int a; Usually doesn t make much difference. Compilers don t care any more. The program only deals with memory addresses (pointers).

Memory allocation and deallocation Types Heap: Long time memory. Large allocations. Stack: First in, Last out ( FILO ) type. Short term memory.

The stack, pros and cons Memory is managed(no need to free). Memory stays in scope as long as it is on the stack. Memory is destroyed when it is popped off the stack. Very fast access. The stack is relatively small. Don t do anything that eats up lots of stack space. allocating large arrays, structures, and classes functions with heavy recursion

Memory allocation and deallocation Heap memory how does it work? Program requests a memory block of some size, by calling malloc(...) OS delivers pointer to such memory Program uses memory Program returns control of the memory block to the OS, by calling free(...)

Problems with heap allocation OS needs to take care of many things: Keep track of all allocations made, and the size of each Find vacant portion of memory when malloc(...) is called When free(...) is called, find and modify/remove saved info about that memory block Risk of heap fragmentation Alignment issues lots of work can take time.

Memory allocation Optimizations Avoid too many small allocations Avoid having many malloc/free calls for small memory buffers. Ways to do this: Allocate few larger blocks and get smaller blocks by pointing into large blocks Use pre-allocated work buffer instead of doing malloc/free each time a function is called Put small things on stack instead of calling malloc Using your own allocation system for many small blocks, can be efficient e.g. if all needed blocks of same size

Cache and data locality Programs often re-use nearby memory locations: Temporal locality: the same address is likely to be needed again soon Spatial locality: nearby addresses likely to be needed soon This is why caches are build into the hardware.

Caches Several levels of cache with varying latency L1 (approx. 10-100 kb) 1-5 cycles L2 (256 kb- ) 10-20 cycles L3 (> 2 MB), slower Caches organized into cache lines, approx. 128 bytes each L1 cache of two types: L1d (data) and L1i (instruction) cache.

Example of costs for cache/memory access To where Cycles Register 1 L1d 3 L2 14 Main Memory 240 (Numbers listed by Intel for a Pentium M)

Cache optimizations Improve data locality! Example: blocking Optimizations can have different effects depending on e.g. prefetching capabilities in the hardware.

How do caches work? Direct-mapped cache : Hash function maps to a single place for every unique address Set associative cache : Cache space is divided into n sets Address is distributed modulo n k-way set associative (k = 2...24) Fully associative cache : Maps to any slot In practice, k-way set associative commonly used, e.g. L1d cache commonly 2-way set associative.

General caching concepts Cache hit Cache miss: Compulsory: On the first access to a block.(cold start misses, first reference misses). Capacity: When blocks are being discarded from cache because cache cannot contain all blocks. Conflict: In set associative or direct mapped strategies, conflict misses occur when several blocks are mapped to the same set or block frame; also called collision misses or interference misses. Hit rate and miss rate

Cache optimization how? Try to increase the chances that the needed memory locations are already in cache: Nearby memory accesses in time should be to nearby locations in memory

Caches and performance Caches are extremely important for performance Level 1 latency is usually 1 or 2 cycles Work well for problems with nice locality properties Caching can be used in other areas as well, example: web-caching (proxies) Modern CPUs have two or three levels of cache Most of the chip area is used for caches

How caches work cache lines Cache divided into cache lines When memory is read into cache, a whole cache line is always read at the same time Good if we have data locality: nearby memory accesses will be fast Typical cache line size: 64-128 bytes

Cache optimization example Loop order for matrix-matrix multiplication for (i=0; i<n; i++) for (j=0; j<n; j++) for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; for (k=0; k<n; k++) for (j=0; j<n; j++) for (i=0; i<n; i++) c[i][j] += a[i][k] * b[k][j]; for (k=0; k<n; k++) for (i=0; i<n; i++) for (j=0; j<n; j++) c[i][j] += a[i][k] * b[k][j];

Test problem, matrices of size 1000 Times are in Milli-seconds Compiler i loop j loop k loop k loop restrict Clang 7979562 7936971 658910 666147 GCC 4.9 7562046 7476514 696239 694889 Intel 502837 482113 477861 481042

Memory is one-dimensional Any 2D, 3D, etc. array must somehow be mapped to 1D memory address space. Example: matrix A (2D array N N), matrix element A ij can be accessed as A[i*N+j] (or as A[j*N+i] depending on convention used). Alternative: use array of pointers A ij accessed as A[i][j].

Data structures Array of structs vs multiple arrays typedef struct { double x; double y; } structtype ; structtype list [1000]; for (k =0; k <1000; k ++) { list [k].x = 0; list [k].y = 0; } double x_list [1000]; double y_list [1000]; for (k =0; k <1000; k ++) { x_list [ k] = 0; y_list [ k] = 0; }

Function side effects are bad When a function modifies something other than its output value or output arguments, the function has side effects. Examples: modification of global variables or writing to files. Function side effects are: bad for flexibility bad for performance

Function side effects are bad What to do with global variables When you need global variables Don t use global variables. Avoid it as much as possible. Ideas: Pass as input argument. Gather in a structure. Singlton(C++, needs OOP).

More on function side effects What does the compiler know? int func1 (x) { return f(x)+f(x)+f(x)+f(x); } int func2 (x) { return 4 * f( x); } Are these two equivalent? Do you know? Does the compiler know?

Side effects in loops What does the compiler know? char * str = " Hello World!"; int i; for (i = 0; i < strlen ( str ); i ++) { if ( str [i] ==! ) str [i] =? ; } What is the complexity? Can you improve it? Can the compiler improve it? Does the compiler know the implementation of strlen?

Remedy inline function less overhead and is generally faster than a function call. Best candidates are small functions that are called frequently from a few places, or functions called with one or more compile-time constant parameter. compiler attributes i.e. attribute. const function

Pure functions Can use gcc attribute ((pure)) in function declaration to tell the compiler that a function is pure (free from side effects): int f( int k); int f( int k) attribute (( pure )); attribute ((const)) also exists but is more strict; then the function is not allowed to use pointer arguments. See gcc documentation for details.

The const keyword Letting the compiler know more Use the const keyword to tell the compiler that the value will never change. Allows compiler to optimize more because the compiler knows more. Using const also helps avoid programming errors, like modifying something by mistake. Using const is both good for program design, and good for performance!

The const keyword Example int M = 50; Variable not declared as const compiler must assume that the value may change. const int M = 50; Variable declared as const compiler knows value will not change. Compiler may be able to optimize more thanks to this info.

Pointer aliasing Any pointer of unknown origin can reference a value that is accessed through another variable Any pointer might be used as an array of unknown size Multiple aliases for the same memory location Makes compile-time optimization very hard need ways to inform compiler that aliasing does not occur.

The strict aliasing rule Default mode in C99 and recent GCC Pointers of different types are not allowed to refer to the same memory Significant compilation benefits

The restrict keyword Available in many C/C++ compilers, including recent gcc (sometimes as restrict) Within this context, any memory locations accessed by a restricted (pointer) variable will only be accessed through that pointer Example: void copystring ( char * restrict dest, char * restrict from ) {... } Not all compilers are created equal. Depending on the compiler, the attribute may be omitted, used or give wrong results (As in task 3 in the lab) Compiler Standard Optimized Clang 0.380 0.379 GCC 4.9 0.133 0.124 Intel 0.097 0.095

The restrict keyword void f( double * v, int N, double * x) { int i; for ( i = 0; i < N; i ++) v[ i] += ( A + B* x [0] + C* x [1] ) * D; } Performance problem due to pointer aliasing?

The restrict keyword void f( double * v, int N, double * x) { int i; for ( i = 0; i < N; i ++) v[ i] += ( A + B* x [0] + C* x [1] ) * D; } Optimize by adding restrict keyword: void f( double * restrict v, int N, double * restrict x) { int i; for ( i = 0; i < N; i ++) v[ i] += ( A + B* x [0] + C* x [1] ) * D; } No change in function body, only added restrict in argument list.

That s all Questions?