High-Performance Parallel Computing

Similar documents
211: Computer Architecture Summer 2016

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Today Cache memory organization and operation Performance impact of caches

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Giving credit where credit is due

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

CISC 360. Cache Memories Nov 25, 2008

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Cache Memories October 8, 2007

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

Last class. Caches. Direct mapped

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

Cache Memories. Andrew Case. Slides adapted from Jinyang Li, Randy Bryant and Dave O Hallaron

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Carnegie Mellon. Cache Memories

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Memory Hierarchy. Announcement. Computer system model. Reference

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Example. How are these parameters decided?

Cache Memories : Introduc on to Computer Systems 12 th Lecture, October 6th, Instructor: Randy Bryant.

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

211: Computer Architecture Summer 2016

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Memory Hierarchy. Instructor: Adam C. Champion, Ph.D. CSE 2431: Introduction to Operating Systems Reading: Chap. 6, [CSAPP]

CISC 360. Cache Memories Exercises Dec 3, 2009

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

CSCI 402: Computer Architectures. Performance of Multilevel Cache

Computer Organization: A Programmer's Perspective

Lecture 2. Memory locality optimizations Address space organization

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Computer Organization - Overview

Lecture 17: Memory Hierarchy and Cache Coherence Concurrent and Mul7core Programming

Caches Concepts Review

Introduction to Computer Systems

Introduction to Computer Systems

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Cache Memory and Performance

How to Write Fast Numerical Code

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

EE108B Lecture 13. Caches Wrap Up Processes, Interrupts, and Exceptions. Christos Kozyrakis Stanford University

CS 240 Stage 3 Abstractions for Practical Systems

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Introduction to Computer Systems: Semester 1 Computer Architecture

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Announcements. EE108B Lecture 13. Caches Wrap Up Processes, Interrupts, and Exceptions. Measuring Performance. Memory Performance

CS3350B Computer Architecture

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Matrix Multiplication

1/19/2009. Data Locality. Exploiting Locality: Caches

CS429: Computer Organization and Architecture

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Matrix Multiplication

CS429: Computer Organization and Architecture

How to Write Fast Numerical Code

F28HS Hardware-Software Interface: Systems Programming

Cache Memory and Performance

CS61C : Machine Structures

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Cache Memory and Performance

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

Cache Performance II 1

CS222: Cache Performance Improvement

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

14:332:331. Week 13 Basics of Cache

CS422 Computer Architecture

Computer Architecture Spring 2016

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

14:332:331. Week 13 Basics of Cache

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

Page 1. Multilevel Memories (Improving performance using a little cash )

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

CS/ECE 250 Computer Architecture

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Cache Impact on Program Performance. T. Yang. UCSB CS240A. 2017

CSE 141: Computer Architecture. Professor: Michael Taylor. UCSD Department of Computer Science & Engineering

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Lecture 5: Matrices. Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Read-only memory (ROM): programmed during production Programmable ROM (PROM): can be programmed once SRAM (Static RAM)

Memory hierarchies: caches and their impact on the running time

+ Random-Access Memory (RAM)

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

Transcription:

High-Performance Parallel Computing P. (Saday) Sadayappan Rupesh Nasre

Course Overview Emphasis on algorithm development and programming issues for high performance No assumed background in computer architecture; assume knowledge of C Grading: 60% Programming Assignments (4 x 5%) 40% Final Exam (July 4) Accounts will be provided on IIT-M system 2

Course Topics Architectures n Single processor core n Multi-core and SMP (Symmetric Multi-Processor) systems n GPUs (Graphic Processing Units) n Short-vector SIMD instruction set architectures Programming Models/Techniques n Single processor performance issues l Caches and Data locality l Data dependence and fine-grained parallelism l Vectorization (SSE, AVX) n Parallel programming models l Shared-Memory Parallel Programming (OpenMP) l GPU Programming (CUDA) 3

Class Meeting Schedule Lecture from 9am-2pm each day, with mid-class break Optional Lab session from 2-pm each day 4 Programming Assignments Topic Due Date Weightage Data Locality June 24 5% OpenMP June 27 5% CUDA June 30 5% Vectorization July 2 5% 4

The Good Old Days for Software Source: J. Birnbaum Single-processor performance experienced dramatic improvements from clock, and architectural improvement (Pipelining, Instruction-Level- Parallelism) Applications experienced automatic performance 5

Hitting the Power Wall Watts/cm 2 000 00 0 Power doubles every 4 years 5-year projection: 200W total, 25 W/cm2! i386 Hot plate i486 Pentium Nuclear Reactor Pentium 4 Pentium III Pentium II Pentium Pro P=VI: 75W @.5V = 50 A!.5m m 0.7m 0.5m 0.35m 0.25m 0.8m 0.3m 0.m 0.07m Rocket Nozzle Sun's Surface * New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies 6 Fred Pollack, Intel Corp. Micro32 conference key note - 999. Courtesy Avi 6

Hitting the Power Wall toward a brighter tomorrow http://img.tomshardware.com/us/2005//2/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 7

Hitting the Power Wall http://img.tomshardware.com/us/2005//2/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 2004 Intel cancels Tejas and Jayhawk due to "heat problems due to the extreme power consumption of the core..." 8

The Only Option: Use Many Cores Chip density is continuing increase ~2x every 2 years n Clock speed is not n Number of processor cores may double There is little or no more hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) 9 9

Can You Predict Performance? E takes about 0.4s to execute on this laptop Performance in GFLOPs: billions (Giga) of FLoating point Operations per Second = About how long does E2 take?. [0-0.4s] 2. [0.4s-0.6s] 3. [0.6s-0.8s] 4. More than 0.8s double W,X,Y,Z; for(j=0;j<00000000;j++){ W = 0.999999*X; X = 0.999999*W; // Example loop E for(j=0;j<00000000;j++){ W = 0.999999*W + 0.00000; X = 0.999999*X + 0.00000; Y = 0.999999*Y + 0.00000; Z = 0.999999*Z + 0.00000; // Example loop E2 E2 takes 0.35s => 2.27 GFLOPs 0

ILP Affects Performance ILP (Instruction Level Parallelism): Many operations in a sequential code could be executed concurrently if they do not have dependences Pipelined stages in functional units can be exploited by ILP Multiple functional units in a CPU be exploited by ILP ILP is automatically exploited by the system when possible E2 s statements are independent and provide ILP, but E s statements are not, and do not provide ILP double W,X,Y,Z; for(j=0;j<00000000;j++){ W = 0.999999*X; X = 0.999999*W; // Example loop E for(j=0;j<00000000;j++){ W = 0.999999*W + 0.00000; X = 0.999999*X + 0.00000; Y = 0.999999*Y + 0.00000; Z = 0.999999*Z + 0.00000; // Example loop E2

Example: Memory Access Cost #define N 32 #define T 024*024 // #define N 4096 // #define T 64 double A[N][N]; for(it=0; it<t; it++) for (j=0; i<n; i++) for (i=0; i<n; i++) A[i][j] += ; Data Movement overheads from main memory to cache can greatly exceed computational costs About how long will code run for the 4Kx4K matrix?. [0-0.3s] 2. [0.3s-0.6s] 3. [0.6s-.2s] 4. More than.2s 32x32 4Kx4K FLOPs 230 230 Time 0.33s 5.5s GFLOPs 3.24 0.07 2

Cache Memories Adapted from slides by Prof. Randy Bryant (CMU CS5-23) and Dr. Jeffrey Jones (OSU CSE 544) Topics n Generic cache memory organization n Impact of caches on performance 3

Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. n Hold frequently accessed blocks of main memory CPU looks first for data in L, then in L2, then in main memory. Typical bus structure: CPU chip register file L ALU cache cache bus system bus memory bus L2 cache bus interface I/O bridge main memory 4

Inserting an L Cache Between the CPU and Main Memory The transfer unit between the CPU register file and the cache is a 4-byte block. line 0 line The transfer unit between the cache and main memory is a 4-word block (6 bytes). block 0 block 2 block 30 a b c d... p q r s... w x y z... The tiny, very fast CPU register file has room for four 4-byte words. The small fast L cache has room for two 4-word blocks. The big slow main memory has room for many 4-word blocks. 5

Direct-Mapped Cache Byte 0000 Byte 0 0000 Byte 0000 0 Byte 2 0000 0 Byte 3 000 Byte 4 000 Byte 5 000 0 Byte 6 000 0 Byte 7 000 Byte 8 000 Byte 9 0000 Byte 0 000 Byte 00 Byte 2 00 Byte 3 000 Byte 4 00 0 Byte 5 000 Byte 6 000 00 Byte 7 000 8 0 Byte Byte 63Memory64 = = byte memory -64 viewed as 8 blocks of 8 bytes each Memory Block Memory 0 Block Memory Block Memory 2 Block Memory 3 Block 4 Memory Block 5 Memory Block Memory 6 Block 7 Round-robin mapping of memory blocks to cache lines Binary Byte Address valid tag Data Line 0 valid tag Data valid tag Data valid tag Data Each memory block maps to a particular line Line Line 2 Line 3 6

Direct-Mapped Cache RAM byte addressable appropriate bytes for data type transferred to register AC AD register AA AB AC AD AE AF B0 B AA AB AC AD AE AF B0 B system with 64bit data paths transfers 8 bytes cache cache line 7

Accessing Direct-Mapped Caches Set selection n Use the set (= line) index bits to determine the set of interest. set 0: valid tag cache block selected set (line) set : valid tag cache block m- t bits tag s bits 0 0 0 0 b bits set index block offset 0 set S-: valid tag cache block 8

Accessing Direct-Mapped Caches Line matching and word selection n Line matching: Find a valid line in the selected set with a matching tag n Word selection: Then extract the word selected set (i): (2) The tag bits in the cache line must match the tag bits in the address =? () The valid bit must be set m- t bits 00 tag =? 0 2 3 4 5 6 7 00 w0 w w2 w3 s bits i b bits 00 set index block offset 0 (3) If () and (2), then cache hit, and block offset selects starting byte. 9

Direct-Mapped Cache Simulation t= s=2 b= x xx x M=6 byte addresses, B=2 bytes/block, S=4 sets, E= entry/set Address trace (reads): 0 [00002], [0002], 3 [02], 8 [0002], 0 [00002] 0 [00002] (miss) v tag data 0 M[0-] m[] m[0] 3 [02] (miss) v tag data 0 M[0-] m[] m[0] () (3) M[2-3] m[3] m[2] 8 [0002] (miss) v tag data M[8-9] m[9] m[8] 0 [00002] (miss) v tag data 0 M[0-] m[] m[0] (4) M[2-3] (5) M[2-3] m[3] m[2] 20

General Structure of Cache Memory Cache is an array of sets. valid bit per line t tag bits per line B = 2b bytes per cache block Each set contains one or more lines. set 0: valid valid tag tag 0 0 B B E lines per set Each line holds a block of data. valid tag 0 B S = 2s sets set : valid tag 0 B valid tag 0 B set S-: valid tag 0 B Cache size: C = B x E x S data bytes 2

Addressing Caches Address A: t bits s bits b bits set 0: v v tag tag 0 0 B B m- <tag> <set index> <block offset> 0 set : set S-: v v v v tag tag tag tag 0 0 0 0 B B B B The word at address A is in the cache if the tag bits in one of the <valid> lines in set <set index> match <tag>. The word contents begin at offset <block offset> bytes from the beginning of the block. 22

Cache Mapping Examples Address A: t bits s bits b bits m- 0 <tag> <set index> <block offset> Example m C B E 32 024 4 2 32 024 8 4 3 32 024 32 32 4 64 2048 64 6 23

Cache Mapping Examples Address A: t bits s bits b bits m- 0 <tag> <set index> <block offset> Example m C B E S t s b 32 024 4 256 22 8 2 2 32 024 8 4 32 24 5 3 3 32 024 32 32 27 0 5 4 64 2048 64 6 2 57 6 24

Cache Performance Metrics Miss Rate n Fraction of memory references not found in cache (misses/references) n Typical numbers: l 3-0% for L l can be quite small (e.g., < %) for L2, depending on size, etc. Hit Time n Time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache) n Typical numbers: l clock cycle for L l 3-8 clock cycles for L2 Miss Penalty n Additional time required because of a miss l Typically 25-00 cycles for main memory 25

overall cache performance AMAT (Avg. Mem. Access Time) = t_hit + prob_miss * penalty_miss reduce hit time reduce miss penalty reduce miss rate 26

Cache Performance: One Level AMAT = t_hit + prob_miss * penalty_miss L$ hits in cycle, with 80% hit rate main memory hits in 000 cycles AMAT = + (-.8)(000) = 20 L$ hits in cycle, with 85% hit rate main memory hits in 000 cycles AMAT = + (-.85)(000) = 5 27

Cache Performance: Multi- Level AMATi = t_hiti + prob_missi * penalty_missi L$ hits in cycle, with 50% hit rate L2$ hits in 0 cycles, with 75% hit rate L3$ hits in 00 cycles, with 90% hit rate main memory hits in 000 cycles AMAT = + (-.5)(AMAT2) = + (-.5)(0 + (-.75)(AMAT3)) = + (-.5)(0 + (-.75)(00 + (-.9)(AMATm))) = + (-.5)(0 + (-.75)(00 + (-.9)(000))) OSU CSE 544 28 = 3 J. S. Jones

Cache Performance: Multi- Level AMATi = t_hiti + prob_missi * penalty_missi L$ hits in cycle, with 25% hit rate L2$ hits in 6 cycles, with 80% hit rate L3$ hits in 60 cycles, with 95% hit rate main memory hits in 000 cycles AMAT = + (-.25)(AMAT2) = + (-.25)(6 + (-.8)(AMAT3)) = + (-.25)(6 + (-.8)(60 + (-.95)(AMATm))) = + (-.25)(6 + (-.8)(60 + (-.95)(000))) OSU CSE 544 29 = 22 J. S. Jones

Layout of C Arrays in Memory C arrays allocated in row-major order n each row in contiguous memory locations 30

Layout of C Arrays in Memory C arrays allocated in row-major order n each row in contiguous memory locations Stepping through columns in one row: n for (i = 0; i < N; i++) sum += a[0][i]; n accesses successive elements n if block size (B) > 4 bytes, exploit spatial locality l compulsory miss rate = 4 bytes / B Stepping through rows in one column: n for (i = 0; i < n; i++) sum += a[i][0]; n accesses distant elements n no spatial locality! l compulsory miss rate = (i.e. 00%) 3

Writing Cache Friendly Code Repeated references to variables are good (temporal locality) Stride- reference patterns are good (spatial locality) Example: Summing all elements of a 2D array n cold cache, 4-byte words, 4-word cache blocks int sumarrayrows(int a[n][n]) { int i, j, sum = 0; int sumarraycols(int a[n][n]) { int i, j, sum = 0; for (i = 0; i < N; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; for (j = 0; j < N; j++) for (i = 0; i < N; i++) sum += a[i][j]; return sum; Miss rate = /4 = 25% Miss rate = 00% 32

Matrix Multiplication Example Major Cache Effects to Consider n Total cache size l Exploit temporal locality and keep the working set small (e.g., by using blocking) n Block size l Exploit spatial locality Description: n Multiply N x N matrices n O(N3) total operations n Accesses l N reads per source element l N values summed per destination» but may be able to hold in register /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; Variable sum held in register 33

Miss Rate Analysis for Matrix Multiply Assume: n Line size = 32B (big enough for 4 64-bit words) n Matrix dimension (N) is very large l Approximate /N as 0.0 n Cache is not even big enough to hold multiple rows Analysis Method: n Look at access pattern of inner loop k j j i k i A B C 34

Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; Misses per Inner Loop Iteration: A B C 0.25.0 0.0 Inner loop: (i,*) (*,j) (i,j) A B C Row-wise Columnwise Fixed 35

Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum Inner loop: (*,j) (i,j) (i,*) A B C Misses per Inner Loop Iteration: Row-wise Columnwise Fixed C A B 0.25.0 0.0 36

Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; Inner loop: (i,k) (k,*) A B C (i,*) Fixed Row-wise Row-wise Misses per Inner Loop Iteration: A B C 0.0 0.25 0.25 37

Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; Inner loop: (i,k) (k,*) (i,*) A B C Fixed Row-wise Row-wise Misses per Inner Loop Iteration: A B C 0.0 0.25 0.25 38

Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; Inner loop: (*,k) (*,j) (k,j) A B C Misses per Inner Loop Iteration: C.0 A B.0 0.0 Column - wise Fixed Columnwise 39

Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; Inner loop: (*,k) (*,j) (k,j) A B C Misses per Inner Loop Iteration: A B C.0 0.0.0 Fixed Columnwise Columnwise 40

Summary of Matrix Multiplication ijk (& jik): 2 loads, 0 stores misses/iter =.25 kij (& ikj): 2 loads, store jki misses/iter = 0.5 (& kji): 2 loads, store misses/iter = 2.0 for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k] [j]; c[i][j] = sum; for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; 4