CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

Similar documents
Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 4. Performance: memory system, programming, measurement and metrics Vectorization

Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions

Lecture 2. Memory locality optimizations Address space organization

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism

Lecture 3. Vectorization Memory system optimizations Performance Characterization

Lecture 14 The C++ Memory model Implementing synchronization SSE vector processing (SIMD Multimedia Extensions)

Lecture 3. Programming with GPUs

Lecture 7. OpenMP: Reduction, Synchronization, Scheduling & Applications

COSC 6385 Computer Architecture - Pipelining (II)

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Instruction-Level Parallelism and Its Exploitation

EECC551 Exam Review 4 questions out of 6 questions

Multi-cycle Instructions in the Pipeline (Floating Point)

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Floating Point/Multicycle Pipelining in DLX

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Course on Advanced Computer Architectures

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CSE 262 Spring Scott B. Baden. Lecture 1 Introduction

Announcements. Scott B. Baden / CSE 160 / Wi '16 2

Pipelining. Maurizio Palesi

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Updated Exercises by Diana Franklin

Structure of Computer Systems

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CMSC 611: Advanced Computer Architecture

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Computer Systems Architecture Spring 2016

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

Pipeline: Introduction

Lecture 3. SIMD and Vectorization GPU Architecture

CS433 Homework 3 (Chapter 3)

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

CS433 Homework 2 (Chapter 3)

ECE 505 Computer Architecture

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

Processor: Superscalars Dynamic Scheduling

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Parallel Processing SIMD, Vector and GPU s

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Lecture 4: Introduction to Advanced Pipelining

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Computer Architecture. Lecture 6.1: Fundamentals of

Instruction Pipelining

High Performance Computing in C and C++

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Outline Marquette University

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

Lecture 17. NUMA Architecture and Programming

Computer Architecture Practical 1 Pipelining

Parallel Processing SIMD, Vector and GPU s

Lecture 19 Introduction to Pipelining

Modern Computer Architecture

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Dr. Joe Zhang PDC-3: Parallel Platforms

SIMD Programming CS 240A, 2017

Lect. 2: Types of Parallelism

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Detailed Scoreboard Pipeline Control. Three Parts of the Scoreboard. Scoreboard Example Cycle 1. Scoreboard Example. Scoreboard Example Cycle 3

Chapter 8. Pipelining

EE 4683/5683: COMPUTER ARCHITECTURE

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

CS433 Homework 2 (Chapter 3)

Instruction Level Parallelism (ILP)

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS4230 Parallel Programming. Lecture 7: Loop Scheduling cont., and Data Dependences 9/6/12. Administrative. Mary Hall.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

The Art of Parallel Processing

Instruction Pipelining

Parallel Computer Architecture and Programming Written Assignment 3

Lecture 15: Pipelining. Spring 2018 Jason Tang

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

Spring 2011 CSE Qualifying Exam Core Subjects. February 26, 2011

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Advanced Computer Architecture

CS560 Lecture Parallel Architecture 1

High Performance Computing: Tools and Applications

EE/CSCI 451: Parallel and Distributed Computation

Online Course Evaluation. What we will do in the last week?

Complex Pipelining. Motivation

Uniprocessor Optimizations. Idealized Uniprocessor Model

CSE 160 Lecture 9. Load balancing and Scheduling Some finer points of synchronization NUMA

Transcription:

CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization

Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2

Particle simulation Two routines contain the work to be parallelized according to openmp #pragmas Anything not inside OMP parallel for should run serially Synchronization? void SimulateParticles(){ for( int step = 0; step < nsteps; step++ ) { apply_forces(particles,n); if (imbal) imbal_particles(particles,n); move_particles(particles,n); VelNorms(particles,n,uMax,vMax,uL2,vL2); } } 2013 Scott B. Baden / CSE 160 / Winter 2013 4

Code for computing the force OpenMP parallelization of the outermost loop void apply_forces( particle_t* particles, int n){ #pragma omp parallel for shared(particles,n) for( int i = 0; i < n; i++ ) { particles[i].ax = particles[i].ay = 0; if ((particles[i].vx 0) (particles[i].vx 0)) // for imbalanced for (int j = 0; j < n; j++ ){ // case if (i==j) continue; particles[i].{ax,ay} += coef * {dx,dy}; } } } 2013 Scott B. Baden / CSE 160 / Winter 2013 5

Today s lecture Pipelining Instruction level parallelism (ILP) Vectorization, SSE, vector instructions 2013 Scott B. Baden / CSE 160 / Winter 2013 6

What makes a processor run faster? Registers and cache Pipelining Instruction level parallelism Vectorization, SSE 2013 Scott B. Baden / CSE 160 / Winter 2013 7

Pipelining Assembly line processing - an auto plant Dave Patterson s Laundry example: 4 people doing laundry A B C wash (30 min) + dry (40 min) + fold (20 min) = 90 min 6 PM 7 8 9 Time 30 40 40 40 40 20 Sequential execution takes 4 * 90min = 6 hours Pipelined execution takes 30+4*40+20 = 3.5 hours Bandwidth = loads/hour Pipelining helps bandwidth but not latency (90 min) Bandwidth limited by slowest pipeline stage Potential speedup = Number pipe stages D 2013 Scott B. Baden / CSE 160 / Winter 2013 8

Instruction level parallelism Execute more than one instruction at a time x= y / z; a = b + c; Dynamic techniques Allow stalled instructions to proceed Reorder instructions x = y / z; a = x + c; t = c q; x = y / z; t = c q; a = x + c; 2013 Scott B. Baden / CSE 160 / Winter 2013 9

MIPS R4000 Multi-execution pipeline integer unit ex FP/int Multiply IF ID m1 m2 m3 m4 m5 m6 m7 MEM WB FP adder a1 a2 a3 a4 x= y / z; a = b + c; s = q * r; i++; FP/int divider Div (latency = 25) x = y / z; t = c q; a = x + c; David Culler 2013 Scott B. Baden / CSE 160 / Winter 2013 10

Scoreboarding, Tomasulo s algorithm Hardware data structures keep track of When instructions complete Which instructions depend on the results When it s safe to write a reg. Deals with data hazards WAR (Write after read) RAW (Read after write) WAW (Write after write) Instruction Fetch Issue & Resolve op fetch ex 2013 Scott B. Baden / CSE 160 / Winter 2013 11 Scoreboard op fetch ex a = b*c x= y b q = a / y y = x + b FU

What makes a processor run faster? Registers and cache Pipelining Instruction level parallelism Vectorization, SSE 2013 Scott B. Baden / CSE 160 / Winter 2013 12

Control Mechanism Flynn s classification (1966) How do the processors issue instructions? PE + CU SIMD: Single Instruction, Multiple Data Execute a global instruction stream in lock-step PE + CU PE + CU Interconnect PE + CU PE PE + CU Control Unit PE PE PE Interconnect MIMD: Multiple Instruction, Multiple Data Clusters and servers processors execute instruction streams independently PE 2013 Scott B. Baden / CSE 160 / Winter 2013 13

SIMD (Single Instruction Multiple Data) Operate on regular arrays of data Two landmark SIMD designs ILIAC IV (1960s) Connection Machine 1 and 2 (1980s) Vector computer: Cray-1 (1976) Intel and others support SIMD for multimedia and graphics SSE Streaming SIMD extensions, Altivec Operations defined on vectors GPUs, Cell Broadband Engine (Sony Playstation) Reduced performance on data dependent or irregular computations 1 4 2 6 1 = 2 2 * 3 forall i = 0:N-1 p[i] = a[i] * b[i] 1 2 1 2 forall i = 0 : n-1 x[i] = y[i] + z [ K[i] ] end forall forall i = 0 : n-1 if ( x[i] < 0) then y[i] = x[i] else y[i] = x[i] 2013 Scott B. Baden / CSE 160 / Winter 2013 end if end forall 14

Streaming SIMD Extensions SSE (SSE4 on Intel Nehalem), Altivec Short vectors: 128 bits (AVX: 256 bits) en.wikipedia.org/wiki/streaming_simd_extensions for i = 0:N-1 { p[i] = a[i] * b[i];} X X X X a b p 1 4 2 6 1 = 2 2 * 3 1 2 1 2 4 floats 2 doubles 16 bytes Jim Demmel 2013 Scott B. Baden / CSE 160 / Winter 2013 15

How do we use the SSE instructions? Low level: assembly language or libraries Higher level: a vectorizing compiler Bang (sse3): g++ -O3 automatically vectorizes g++ --std=c++11 float b[n], c[n]; for (int i=0; i<n; i++) b[i] += b[i]*b[i] + c[i]*c[i]; -O3 -ftree-vectorizer-verbose=2 7: LOOP VECTORIZED. vec.cpp:6: note: vectorized 1 loops in function.. Performance Single precision: With vectorization : 1.9 sec. Without vectorization : 3.2 sec. Double precision: With vectorization: 3.6 sec. Without vectorization : 3.3 sec. http://gcc.gnu.org/projects/tree-ssa/vectorization.html 2013 Scott B. Baden / CSE 160 / Winter 2013 16

How does the vectorizer work? Transformed code for (i = 0; i < 1024; i+=4) a[i:i+3] = b[i:i+3] + c[i:i+3]; Vector instructions for (i = 0; i < 1024; i+=4){ vb = vec_ld( &b[i] ); vc = vec_ld( &c[i] ); va = vec_add( vb, vc ); vec_st( va, &a[i] ); } 2013 Scott B. Baden / CSE 160 / Winter 2013 17

What prevents vectorization Data dependencies for (int i = 1; i < N; i++) b[i] = b[i-1] + 2; b[1] = b[0] + 2; b[2] = b[1] + 2; b[3] = b[2] + 2; not vectorized, possible dependence between datarefs b[d.30220_12] and b[i_42] Inner loops only for(int j=0; j< reps; j++) for (int i=0; i<n; i++) a[i] = b[i] + c[i]; 2013 Scott B. Baden / CSE 160 / Winter 2013 18

Interrupted flow out of the loop for (i=0; i<n; i++) { a[i] = b[i] + c[i]; maxval = (a[i] > maxval? a[i] : maxval); if (maxval > 1000.0) break; } What prevents vectorization Loop not vectorized/parallelized: multiple exits This loop will vectorize for (i=0; i<n; i++) { a[i] = b[i] + c[i]; maxval = (a[i] > maxval? a[i] : maxval); } 2013 Scott B. Baden / CSE 160 / Winter 2013 19

In class exercises 2013 Scott B. Baden / CSE 160 / Winter 2013 20

Questions 1. Iteration to thread mapping 2. Removing data dependencies 3. Dependence analysis 4. Time constrained scaling 5. Tree Summation 6. Performance 2013 Scott B. Baden / CSE 160 / Winter 2013 21

2. Removing data dependencies B initially: 0 1 2 3 4 5 6 7 B on 1 thread: 7 7 7 7 11 12 13 14 How can we split into 2 loops so that each loop parallelizes, the result it correct? #pragma omp for shared (N,B) for i = 0 to N-1 B[i] += B[N-1-i]; B[0] += B[7], B[1] += B[6], B[2] += B[5] B[3] += B[4], B[4] += B[2], B[5] += B[2] B[7] += B[0] 2013 Scott B. Baden / CSE 160 / Winter 2013 22

Splitting a loop For iterations i=n/2+1 to N, B[N-i] reference newly computed data All others reference old data B initially: 0 1 2 3 4 5 6 7 Correct result: 7 7 7 7 11 12 13 14 for i = 0 to N-1 B[i] += B[N-i]; In $PUB/Examples/OpenMP/Assign Compile with omp=1 on make line 2013 Scott B. Baden / CSE 160 / Winter 2013 for i = 0 to N/2-1 B[i] += B[N-1-i]; for i = N/2+1 to N-1 B[i] += B[N-1-i]; 23

3. Loop Dependence Analysis Which loop(s) can we correctly parallelize with OpenMP? 1. for i = 1 to N-1 A[i] = A[i] + B[i-1]; 2. for i = 0 to N-2 A[i+1] = A[i] + 1; 3. for i = 0 to N-1 step 2 A[i] = A[i-1] + A[i]; 4. for i = 0 to N-2{ A[i] = B[i]; C[i] = A[i] + B[i]; E[i] = C[i+1]; } 2013 Scott B. Baden / CSE 160 / Winter 2013 24

4. Time constrained scaling Sum N numbers on P processors Let N >> P Determine the largest problem that can be solved in time T=10 4 time units on 512 processors Let time to perform one addition = 1 time unit Let β = time to add a value inside a critical section 2013 Scott B. Baden / CSE 160 / Winter 2013 25

Performance model Local additions: N/P - 1 Reduction: β (lg P) Since N >> P T(N,P) ~ (N/P ) + β (lg P) Determine the largest problem that can be solved in time T= 10 4 time units on P=512 processors, β = 1000 time units Constraint: T(512,N) 10 4 (N/512) + 1000 (lg 512) = (N/512) + 1000*(9) 10 4 N 5 10 5 (approximately) 2013 Scott B. Baden / CSE 160 / Winter 2013 26

5. Tree Summation Input: an array x[], length N >> P Output: Sum of the elements of x[] Goal: Compute the sum in lg P time Assume P is a power of 2, K = lg P Starter code for m = 0 to K-1 { 3 1 7 0 4 1 6 3 4 7 5 9 } 11 14 25 2013 Scott B. Baden / CSE 160 / Winter 2013 27

Visualizing the Summation Thread 0! Thread 2! Thread 4! Thread 6! 0! 1! 2! 3! 4! 5! 6! 7! 0+1! 2+3! 4+5! 6+7! 0...3! 4..7! 0..7! 2013 Scott B. Baden / CSE 160 / Winter 2013 28

6. Performance You observe the following running times for a parallel program running a fixed workload N Assume that the only losses are due to serial sections What is the speedup and efficiency on 8 processors? What will the running time be on 4 processors? What is the maximum possible speedup on an infinite number of processors? What fraction of the total running time on 1 processor corresponds to the serial section? What fraction of the total running time on 2 processors corresponds to the serial section? NT Time 1 10000 2 6000 8 3000 2013 Scott B. Baden / CSE 160 / Winter 2013 29