CS 152, Spring 2011 Section 10

Size: px

Start display at page:

Download "CS 152, Spring 2011 Section 10"

Wendy Lamb
6 years ago
Views:

1 CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley

2 Agenda Stuff (Quiz 4 Prep)

3 Intel Core 2 Duo (Penryn) Vs. NVidia GTX 280 Intel Core 2 Duo (Penryn) dual- core nm 410 million transistors ~2GHz 3 or 6MB of cache Watts 107mm 2 NVidia GTX 280 each core is 22mm 2 L2 SRAM is 6mm 2 /MB 10 core(?) (240 stream processors) nm 1.4 Billion transistors 576mm MHz(core clock) 236 Watts!!!

4 Quiz 4 VLIW (for real this time) able to write assembly for VLIW software instruction re- ordering loop unrolling software pipelining how code will get scheduled on different pipelines conditional execution (for VLIW, vector, and GPU) types of parallelism (ILP, TLP, DLP) Vector processors able to write vector assembly (including how to strip- mine loops!) chaining Multithreading fine- grain, course- grain, SMT GPUs/SIMT model how do they handle conditional execution/branches? (spoiler alert: branch divergence)

5 VLIW: Very Long Instruction Word Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Architecture requires guarantee of: Parallelism within an instruction => no cross-operation RAW check No data use before data ready => no data interlocks Two Floating-Point Units, Four Cycle Latency Note: Iron Law questions about CPI are about counting the instructions, not the individual ops March 14, 2011 CS152, Spring

6 Loop Unrolling for (i=0; i<n; i++) B[i] = A[i] + C; for (i=0; i<n; i+=4) { } B[i] = A[i] + C; Unroll inner loop to perform 4 iterations at once B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C; Need to handle values of N that are not multiples of unrolling factor with final cleanup loop March 14, 2011 CS152, Spring

7 Software Pipelining

8 Loop Execution for (i=0; i<n; i++) B[i] = A[i] + C; Compile loop: Int1 Int 2 M1 M2 FP+ FPx add r1 ld loop: ld f1, 0(r1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loop Schedule add r2 bne sd fadd How many FP ops/cycle? 1 fadd / 8 cycles = March 14, 2011 CS152, Spring

9 Software Pipelining for (i=0; i<n; i++) B[i] = A[i] + C; Compile loop: ld f1, 0(r1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loop Schedule? How does one do software pipelining? Let s run through an example that does software pipelining WITHOUT loop unrolling March 14, 2011 CS152, Spring

10 Software Pipelining Int1 Int 2 M1 M2 FP+ FPx for (i=0; i<n; i++) B[i] = A[i] + C; Compile loop: ld f1, 0(r1) prolog add r1 ld f1 fadd f2 add r1, 8 fadd f2, f0, f1 iterate sd f2, 0(r2) add r2, 8 bne r1, r3, loop loop: add r2 bne sd f2 epilog March 14, 2011 CS152, Spring

11 Software Pipelining Int1 Int 2 M1 M2 FP+ FPx for (i=0; i<n; i++) B[i] = A[i] + C; Compile loop: ld f1, 0(r1) prolog add r1 ld f1 ld f1 fadd f2 add r1, 8 fadd f2, f0, f1 iterate sd f2, 0(r2) add r2, 8 bne r1, r3, loop loop: add r1 add r2 add r1 bne ld f1 sd f2 sd f2 fadd f2 fadd f2 epilog How many FLOPS/cycle? 1 fadds / 4 cycles = 0.25 add r2 bne sd f2 March 14, 2011 CS152, Spring

12 Pset 4, Question 4 (Vector Processors)

13 Pset 4, Question 4 (Vector Processors)

14 Handout Problem 2: Vector Vector machines often have a lot of memory bandwidth (SX-9 has 256GB/s!). Why do they need it and why do current superscalars not provide as much?

15 Questions?

CS 152, Spring 2011 Section 8

CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia