61C Survey. Agenda. Reference Problem. Application Example: Deep Learning 10/25/17. CS 61C: Great Ideas in Computer Architecture

Size: px

Start display at page:

Download "61C Survey. Agenda. Reference Problem. Application Example: Deep Learning 10/25/17. CS 61C: Great Ideas in Computer Architecture"

Alannah Rogers
6 years ago
Views:

1 CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing SIMD Krste Asanović & Randy Katz 61C Survey It would be nice to have a review lecture every once in a while, actually showing us how things fit in the bigger picture C Topics so far What we learned: 1. Binary numbers 2. C 3. Pointers 4. Assembly language 5. Datapath architecture 6. Pipelining 7. Caches 8. Performance evaluation 9. Floating point What does this buy us? Promise: execution speed Let s check! 4 Reference Problem Matrix multiplication Basic operation in many engineering, data, and imaging processing tasks Image filtering, noise reduction, Many closely related operations E.g. stereo vision (project 4) dgemm double-precision floating-point matrix multiplication Application Example: Deep Learning Image classification (cats ) Pick best vacation photos Machine translation Clean up accent Fingerprint verification Automatic game playing 5 6 1

2 Matrices Square matrix of dimension NxN 0 0 j N-1 C = A*B C ij = Σ k (A ik * B kj ) Matrix Multiplication k B j i A k C N-1 i 7 8 Reference: Python Matrix multiplication in Python c = a * b a, b, c are N x N matrices C N Python [Mflops] MFLOP = 1 Million floatingpoint operations per second (fadd, fmul) dgemm(n ) takes 2*N 3 flops 9 10 Timing Program Execution C versus Python N C [GFLOPS] Python [GFLOPS] x! Which class gives you this kind of power? We could stop here but why? Let s do better!

higher speed Compare airlines: Maximum speed limited by speed of sound and economics Use more and larger airplanes to increase throughput And smaller seats 14 Using Parallelism for Performance Two

Machine Structures (It s a bit more complicated!) Software Parallel Requests Assigned to computer e.g., Search Katz Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions >1 instruction @ one time e.

Performance Hardware Today s Lecture Warehouse Scale Computer Computer Core Core Memory (Cache) Input/Output Instruction Unit(s) Core Functional Unit(s) Smart Phone A0+B0 A1+B1 A2+B2 A3+B3 Cache

3 13 Why Parallel Processing? CPU Clock Rates are no longer increasing Technical & economic challenges Advanced cooling technology too expensive or impractical for most applications Energy costs are prohibitive is only path to higher speed Compare airlines: Maximum speed limited by speed of sound and economics Use more and larger airplanes to increase throughput And smaller seats 14 Using Parallelism for Performance Two basic ways: Multiprogramming run multiple independent programs in parallel Easy Parallel computing run one program faster Hard We ll focus on parallel computing in the next few lectures 15 New-School Machine Structures (It s a bit more complicated!) Software Parallel Requests Assigned to computer e.g., Search Katz Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions >1 one time e.g., 5 pipelined instructions Parallel Data >1 data one time e.g., Add of 4 pairs of words Hardware descriptions All one time Programming Languages Harness Parallelism & Achieve High Performance Hardware Today s Lecture Warehouse Scale Computer Computer Core Core Memory (Cache) Input/Output Instruction Unit(s) Core Functional Unit(s) Smart Phone A0+B0 A1+B1 A2+B2 A3+B3 Cache Memory Logic Gates 16 Single-Instruction/Single-Data Stream (SISD) Processing Unit Sequential computer that exploits no parallelism in either the instruction or data streams. Examples of SISD architecture are traditional uniprocessor machines E.g. our trusted RISC-V pipeline Single-Instruction/Multiple-Data Stream (SIMD or sim-dee ) SIMD computer exploits multiple data streams against a single instruction stream to operations that may be naturally parallelized, e.g., Intel SIMD instruction extensions or NVIDIA Graphics Processing Unit (GPU) This is what we did up to now in 61C 17 Today s topic. 18 3

4 Multiple-Instruction/Multiple-Data Streams (MIMD or mim-dee ) Data Pool Instruction Pool PU PU PU PU Topic of Lecture 19 and beyond. Multiple autonomous processors simultaneously executing different instructions on different data. MIMD architectures include multicore and Warehouse-Scale Computers 19 Multiple-Instruction/Single-Data Stream (MISD) Multiple-Instruction, Single-Data stream computer that exploits multiple instruction streams against a single data stream. Historical significance This has few applications. Not covered in 61C. 20 Flynn* Taxonomy, 1966 SIMD and MIMD are currently the most common parallelism in architectures usually both in same system! Most common parallel processing programming style: Single Program Multiple Data ( SPMD ) Single program that runs on all processors of a MIMD Cross-processor execution coordination using synchronization primitives SIMD Single Instruction Multiple Data SIMD Applications & Implementations Applications Scientific computing Matlab, NumPy Graphics and video processing Photoshop, Big Data Deep learning Gaming Implementations x86 ARM RISC-V vector extensions

First SIMD Extensions: MIT Lincoln Labs TX-2, 1957 x86 SIMD Evolution New

logicalcpu: 4 machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557u CPU @ 3.

features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36

EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.

5 First SIMD Extensions: MIT Lincoln Labs TX-2, 1957 x86 SIMD Evolution New instructions New, wider, more registers More parallelism Laptop CPU Specs SIMD Registers $ sysctl -a grep cpu hw.physicalcpu: 2 hw.logicalcpu: 4 machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557u 3.10GHz machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2apic MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS SIMD Data Types SIMD Vector Mode (Now also AVX-512 available)

31 Problem Today s compilers (largely) do not generate SIMD code Back to assembly x86 (not using RISC-V as no vector hardware yet) Over 1000 instructions to learn Green Book Can we use the compiler

32 x86 Intrinsics AVX Data Types Intrinsics: Direct access to registers & assembly from C Register Intrinsics AVX Code Nomenclature 33 34 x86 SIMD Intrinsics Raw Double-Precision Throughput 4

6 31 Problem Today s compilers (largely) do not generate SIMD code Back to assembly x86 (not using RISC-V as no vector hardware yet) Over 1000 instructions to learn Green Book Can we use the compiler to generate all non-simd instructions? 32 x86 Intrinsics AVX Data Types Intrinsics: Direct access to registers & assembly from C Register Intrinsics AVX Code Nomenclature x86 SIMD Intrinsics Raw Double-Precision Throughput 4 parallel multiplies assembly instruction Characteristic Value CPU i7-5557u Clock rate (sustained) 3.1 GHz Instructions per clock (mul_pd) 2 Parallel multiplies per instruction 4 Peak double FLOPS 24.8 GFLOPS 2 instructions per clock cycle (CPI = 0.5) Actual performance is lower because of overhead 36 6

7 à LAX Get 10/25/17 Vectorized Matrix Multiplication for i ; i+=4 j for j... Inner Loop: k Vectorized dgemm k i += 4 i Performance N Gflops scalar avx x faster But still << theoretical 25 GFLOPS! A trip to LA Commercial airline: Get to SFO & check-in SFOàLAX Get todestination 3 hours 1 hour 3 hours Supersonic aircraft: SFO Get to SFO & check-in Speedup: Total time: 7 hours todestination 3 hours 6 min 3 hours Total time: 6.1 hours Flying time Sflight = 60 / 6 = 10x Trip time Strip = 7 / 6.1 = 1.15x Lecture Processing - SIMD 18: Parallel 41 Amdahl s Law Get enhancement E for your new PC E.g. floating-point rocket booster E Speeds up some task (e.g. arithmetic) by factor S E F is fraction of program that uses this task Execution Time: no speedup T 0 (no E) T E (with E) Speedup: 1-F F 1-F F / SE speedup section Speedup = T O / T E = 1 / ((1-F) + F/S E ) 42 7

8 Big Idea: Amdahl s Law Speedup = T O / T E = 1 / ((1-F) + F/S E ) Part not sped up Part sped up Example: The execution time of half of a program can be accelerated by a factor of 2. What is the program speed-up overall? T O / T E = 1 / ((1-F) + F/S E ) = 1/((1-0.5)+0.5/2) = 1.33 << 2 43 Maximum Achievable Speed-Up Speedup = 1 / ((1-F) + F/S E ) (S E -> ) = 1/(1-F) Question: What is a reasonable # of parallel processors to speed up an algorithm with F = 95%? (i.e. 19/20 th can be sped up) a) Maximum speedup: S max = 1/(1-0.95) = 20 but needs S E =! b) Reasonable engineering compromise: Equal time in sequential and parallel code: (1-F) = F/S E S E = F/(1-F) = 0.95/0.05 = 20 S = 1/( ) = If the portion of the program that can be parallelized is small, then the speedup is limited 20 processors for 10x 500 processors for 19x In this region, the sequential portion limits the performance Strong and Weak Scaling To get good speedup on a parallel processor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem. Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proportionally to the increase in the number of processors Load balancing is another important factor: every processor doing same amount of work Just one unit with twice the load of others cuts speedup almost in half Peer Instruction Suppose a program spends 80% of its time in a square-root routine. How much must you speedup square root to make the program run 5 times faster? Speedup = 1 / ((1-F) + F/S E ) RED: 5 GREEN: 20 ORANGE: 500 None of above Administrivia Midterm 2 is Tuesday during class time! Review Session: Saturday Annex A1 One double-sided crib sheet; we ll provide the RISCV card Bring your ID! We ll check it Room assignments posted on Piazza Project 2 pre-lab due Friday HW4 due 11/3, but advisable to finish by the midterm Project 2-2 due 11/6, but also good to finish before MT2 47 Lecture 19: Thread Leval Parallel Processing 48 8

9 49 Amdahl s Law applied to dgemm Measured dgemm performance Peak Large matrices Processor 5.5 GFLOPS 3.6 GFLOPS 24.8 GFLOPS Why are we not getting (close to) 25 GFLOPS? Something else (not floating-point ALU) is limiting performance! But what? Possible culprits: Cache Hazards Let s look at both! 50 Pipeline Hazards dgemm Loop Unrolling 4 registers Compiler does the unrolling 51 How do you verify that the generated code is actually unrolled? 52 Performance N Gflops scalar avx unroll

10 FPU versus Memory Access How many floating-point operations does matrix multiply take? F = 2 x N 3 (N 3 multiplies, N 3 adds) How many memory load/stores? M = 3 x N 2 (for A, B, C) Many more floating-point operations than memory accesses q = F/M = 2/3 * N Good, since arithmetic is faster than memory access Let s check the code But memory is accessed repeatedly Inner loop: q = F/M = 1! (2 loads and 2 floating-point operations) Datapath Typical Memory Hierarchy On-Chip Components Control RegFile Instr Data Cache Cache Second -Level Cache (SRAM) Main Memory (DRAM) Secondary Memory (Disk Or Flash) Speed (cycles): ½ s 1 s 10 s 100 s ,000,000 s Size (bytes): 100 s 10K s M s G s T s Cost/bit: highest lowest Where are the operands (A, B, C) stored? What happens as N increases? Idea: arrange that most accesses are to fast cache! Third- Level Cache (SRAM) 57 Sub-Matrix Multiplication or: Beating Amdahl s Law 58 Blocking Memory Access Blocking Idea: Rearrange code to use values loaded in cache many times Only few accesses to slow main memory (DRAM) per floating point operation à throughput limited by FP hardware and cache, not slow DRAM P&H, RISC-V edition p

11 Performance N Gflops scalar avx unroll blocking And in Conclusion, Approaches to Parallelism SISD, SIMD, MIMD (next lecture) SIMD One instruction operates on multiple operands simultaneously Example: matrix multiplication Floating point heavy à exploit Moore s law to make fast Amdahl s Law: Serial sections limit speedup Cache Blocking Hazards Loop unrolling 63 11

John Wawrzynek & Nick Weaver

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing SIMD John Wawrzynek & Nick Weaver http://inst.eecs.berkeley.edu/~cs61c 61C Survey It would be nice to have a review lecture