Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

Similar documents
CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

The Flynn Taxonomy, Intel SIMD Instruc(ons Miki Lus(g and Dan Garcia

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Data Level Parallelism

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Data Level Parallelism

Review: Parallelism - The Challenge. Agenda 7/14/2011

Flynn Taxonomy Data-Level Parallelism

Agenda. Review. Costs of Set-Associative Caches

Parallelism and Performance Instructor: Steven Ho

CS 61C: Great Ideas in Computer Architecture

DLP, Amdahl s Law, Intro to Multi-thread/processor systems Instructor: Nick Riasanovsky

CS 110 Computer Architecture

SIMD Programming CS 240A, 2017

An Introduction to Parallel Architectures

ME964 High Performance Computing for Engineering Applications

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Instructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #13. Warehouse Scale Computer

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 61C: Great Ideas in Computer Architecture. Synchronization, OpenMP

ME964 High Performance Computing for Engineering Applications

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Parallel Processing SIMD, Vector and GPU s

Parallel Systems I The GPU architecture. Jan Lemeire

Online Course Evaluation. What we will do in the last week?

Parallel Computing. Prof. Marco Bertini

Part VII Advanced Architectures. Feb Computer Architecture, Advanced Architectures Slide 1

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Static Compiler Optimization Techniques

Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Computer Architecture I Midterm I (solutions)

CS 61C: Great Ideas in Computer Architecture. MIPS Instruction Formats

Lec 25: Parallel Processors. Announcements

Multi-cycle Instructions in the Pipeline (Floating Point)

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Computer and Machine Vision

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

CS61C : Machine Structures

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Shared-Memory Hardware

LECTURE 10. Pipelining: Advanced ILP

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

61C Survey. Agenda. Reference Problem. Application Example: Deep Learning 10/25/17. CS 61C: Great Ideas in Computer Architecture

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

CS 61C: Great Ideas in Computer Architecture. Cache Performance, Set Associative Caches

EC 513 Computer Architecture

Exercises 6 Parallel Processors and Programs

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Lecture 8. Vector Processing. 8.1 Flynn s taxonomy. M. J. Flynn proposed a categorization of parellel computer systems in 1966.

The Processor: Instruction-Level Parallelism

John Wawrzynek & Nick Weaver

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Lecture 1: Introduction

Dan Stafford, Justine Bonnot

Fundamentals of Computer Design

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Lecture #6 Intro MIPS; Load & Store

CS 61C: Great Ideas in Computer Architecture. OpenMP, Transistors

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

CS 61C: Great Ideas in Computer Architecture. MapReduce

Lecture 8: RISC & Parallel Computers. Parallel computers

CS 61C: Great Ideas in Computer Architecture. OpenMP, Transistors

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Lecture 4: MIPS Instruction Set

Multiple Instruction Issue. Superscalars

Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s

CSE 392/CS 378: High-performance Computing - Principles and Practice

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Introduc)on to Machine Language

Architectures of Flynn s taxonomy -- A Comparison of Methods

Processor Architecture and Interconnect

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

Jan Thorbecke

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

CS425 Computer Systems Architecture

COE608: Computer Organization and Architecture

CS61C C/Assembler Operators and Operands Lecture 2 January 22, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

RAID 0 (non-redundant) RAID Types 4/25/2011

Fundamentals of Computers Design

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches, Set Associative Caches, Cache Performance

Fundamentals of Quantitative Design and Analysis

CSCI 402: Computer Architectures

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

A General Discussion on! Parallelism!

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Multicore and Parallel Processing

A General Discussion on! Parallelism!

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

Computer Organization and Components

EECS4201 Computer Architecture

Introduction II. Overview

Transcription:

CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 1 Review of Last Lecture Amdahl s Law limits benefits of parallelization Request Level Parallelism Handle multiple requests in parallel (e.g. web search) MapReduce Data Level Parallelism Framework to divide up data to be processed in parallel Mapper outputs intermediate key value pairs Reducer combines intermediate values with same key 2 Great Idea #4: Parallelism Software Parallel Requests Assigned to computer e.g. search Garcia Parallel Threads Assigned to core e.g. lookup, ads Leverage Parallelism & Achieve High Performance Parallel Instructions > 1 instruction @ one time e.g. 5 pipelined instructions Parallel Data We are here > 1 data item @ one time e.g. add of 4 pairs of words Hardware descriptions All gates functioning in parallel at same time Hardware Warehouse Scale Computer Core Instruction Unit(s) Functional Unit(s) A 0 +B 0 A 1 +B 1 A 2 +B 2 A 3 +B 3 Cache Memory Computer Core Memory Core Input/Output Smart Phone Logic Gates 3 4 Hardware vs. Software Parallelism Flynn s Taxonomy Choice of hardware and software parallelism are independent Concurrent software can also run on serial hardware Sequential software can also run on parallel hardware is for parallel hardware 5 SIMD and MIMD most commonly encountered today Most common parallel processing programming style: Single Program Multiple Data ( SPMD ) Single program that runs on all processors of an MIMD Cross processor execution coordination through conditional expressions (will see later in Thread Level Parallelism) SIMD: specialized function units (hardware), for handling lock step calculations involving arrays Scientific computing, signal processing, multimedia (audio/video processing) 6 1

Single Instruction/Single Data Stream Processing Unit Sequential computer that exploits no parallelism in either the instruction or data streams Examples of SISD architecture are traditional uniprocessor machines 7 Multiple Instruction/Single Data Stream Exploits multiple instruction streams against a single data stream for data operations that can be naturally parallelized (e.g. certain kinds of array processors) MISD no longer commonly encountered, mainly of historical interest only 8 Single Instruction/Multiple Data Stream Computer that applies a single instruction stream to multiple data streams for operations that may be naturally parallelized (e.g. SIMD instruction extensions or Graphics Processing Unit) Multiple Instruction/Multiple Data Stream Multiple autonomous processors simultaneously executing different instructions on different data MIMD architectures include multicore and Warehouse Scale Computers 9 10 Administrivia HW3 due Sunday Proj2 (MapReduce) to be released soon Part 1 due 3/17 Part 2 due 3/24 Work in partners, preferably at least 1 knows Java Midterms graded Collect after lecture today or from Lab TA next week 11 12 2

13 14 SIMD Architectures Data Level Parallelism (DLP): Executing one operation on multiple data streams Example: Multiplying a coefficient vector by a data vector (e.g. in filtering) y[i] := c[i] x[i], 0 i<n Sources of performance improvement: One instruction is fetched & decoded for entire operation Multiplications are known to be independent Pipelining/concurrency in memory access as well Slide 15 Advanced Digital Media Boost To improve performance, Intel s SIMD instructions Fetch one instruction, do the work of multiple instructions MMX (MultiMedia extension, Pentium II processor family) SSE (Streaming SIMD Extension, Pentium III and beyond) 16 Example: SIMD Array Processing for each fin array f = sqrt(f) pseudocode SSE Instruction Categories for Multimedia Support for each f in array { load fto the floating point register calculate the square root write the result from the register to memory } SISD for every 4 members in array { load 4 members to the SSE register calculate 4 square roots in one operation write the result from the register to memory } SIMD 17 Intel processors are CISC (complicated instrs) SSE 2+ supports wider data types to allow 16 8 bit and 8 16 bit operands 18 3

Intel Architecture SSE2+ 128 Bit SIMD Data Types 122 121 96 95 80 79 64 63 48 47 32 31 16 15 16 / 128 bits 122 121 96 95 80 79 64 63 48 47 32 31 16 15 96 95 64 63 32 31 8 / 128 bits 4 / 128 bits 64 63 2 / 128 bits Note: in Intel Architecture (unlike MIPS) a word is 16 bits Single precision FP: Double word (32 bits) Double precision FP: Quad word (64 bits) 19 XMM Registers Architecture extended with eight 128 bit data registers 64 bit address architecture: available as 16 64 bit registers (XMM8 XMM15) e.g. 128 bit packed single precision floating point data type (doublewords), allows four single precision operations to be performed simultaneously 127 0 XMM7 XMM6 XMM5 XMM4 XMM3 XMM2 XMM1 XMM0 20 SSE/SSE2 Floating Point Instructions SSE/SSE2 Floating Point Instructions {SS} Scalar Single precision FP: 1 32 bit operand in a 128 bit register {PS} Packed Single precision FP: 4 32 bit operands in a 128 bit register {SD} Scalar Double precision FP: 1 64 bit operand in a 128 bit register {PD} Packed Double precision FP, or 2 64 bit operands in a 128 bit register 21 xmm: one operand is a 128 bit SSE2 register mem/xmm: other operand is in memory or an SSE2 register {A} 128 bit operand is aligned in memory {U} means the 128 bit operand is unaligned in memory {H} means move the high half of the 128 bit operand {L} means move the low half of the 128 bit operand 22 Example: Add Single Precision FP Vectors Computation to be performed: vec_res.x = v1.x + v2.x; vec_res.y = v1.y + v2.y; vec_res.z = v1.z + v2.z; vec_res.w = v1.w + v2.w; move from mem to XMM register, memory aligned, packed single precision add from mem to XMM register, packed single precision move from XMM register to mem, SSE Instruction Sequence: memory aligned, packed single precision movaps address-of-v1, %xmm0 // v1.w v1.z v1.y v1.x -> xmm0 addps address-of-v2, %xmm0 // v1.w+v2.w v1.z+v2.z v1.y+v2.y v1.x+v2.x -> xmm0 movaps %xmm0, address-of-vec_res 23 Packed and Scalar Double Precision Floating Point Operations Packed Double (PD) Scalar Double (SD) 24 4

Example: Image Converter (1/5) Converts BMP (bitmap) image to a YUV (color space) image format: Read individual pixels from the BMP image, convert pixels into YUV format Can pack the pixels and operate on a set of pixels with a single instruction Bitmap image consists of 8 bit monochrome pixels By packing these pixel values in a 128 bit register, we can operate on 128/8 = 16 values at a time Significant performance boost 25 Example: Image Converter (2/5) FMADDPS Multiply and add packed single precision floating point instruction One of the typical operations computed in transformations (e.g. DFT or FFT) N P = f(n) x(n) n= 1 CISC Instr! 26 Example: Image Converter (3/5) FP numbers f(n) and x(n) in src1 and src2; p in dest; C implementation for N = 4 (128 bits): for (int i = 0; i < 4; i++) p = p + src1[i] * src2[i]; 1) Regular x86 instructions for the inner loop: fmul [ ] faddp [ ] Instructions executed: 4 * 2 = 8 (x86) Example: Image Converter (4/5) FP numbers f(n) and x(n) in src1 and src2; p in dest; C implementation for N = 4 (128 bits): for (int i = 0; i < 4; i++) p = p + src1[i] * src2[i]; 2) SSE2 instructions for the inner loop: //xmm0=p, xmm1=src1[i], xmm2=src2[i] mulps %xmm1,%xmm2 // xmm2 * xmm1 -> xmm2 addps %xmm2,%xmm0 // xmm0 + xmm2 -> xmm0 Instructions executed: 2 (SSE2) 27 28 Example: Image Converter (5/5) FP numbers f(n) and x(n) in src1 and src2; p in dest; C implementation for N = 4 (128 bits): for (int i = 0; i < 4; i++) p = p + src1[i] * src2[i]; 3) SSE5 accomplishes the same in one instruction: fmaddps %xmm0, %xmm1, %xmm2, %xmm0 // xmm2 * xmm1 + xmm0 -> xmm0 // multiply xmm1 x xmm2 packed single, // then add product packed single to sum in xmm0 29 Summary Flynn Taxonomy of Parallel Architectures SIMD: Single Instruction Multiple Data MIMD: Multiple Instruction Multiple Data SISD: Single Instruction Single Data MISD: Multiple Instruction Single Data (unused) Intel SSE SIMD Instructions One instruction fetch that operates on multiple operands simultaneously 128/64 bit XMM registers 30 5

You are responsible for the material contained on the following slides, though we may not have enough time to get to them in lecture. They have been prepared in a way that should be easily readable and the material will be touched upon in the following lecture. 31 32 Data Level Parallelism and SIMD SIMD wants adjacent values in memory that can be operated in parallel Usually specified in programs as loops for(i=0; i<1000; i++) x[i] = x[i] + s; How can we reveal more data level parallelism than is available in a single iteration of a loop? Unroll the loop and adjust iteration rate 33 Looping in MIPS Assumptions: $s0 initial address (beginning of array) $s1 scalar value s $s2 termination address (end of array) Loop: lw $t0,0($s0) addu $t0,$t0,$s1 # add s to array element sw $t0,0($s0) # store result addiu $s0,$s0,4 # move to next element bne $s0,$s2,loop # repeat Loop if not done 34 Loop: lw $t0,0($s0) addu $t0,$t0,$s1 sw $t0,0($s0) lw $t1,4($s0) addu $t1,$t1,$s1 sw $t1,4($s0) lw $t2,8($s0) addu $t2,$t2,$s1 sw $t2,8($s0) lw $t3,12($s0) addu $t3,$t3,$s1 sw $t3,12($s0) addiu $s0,$s0,16 bne $s0,$s2,loop Loop Unrolled NOTE: 1. Using different registers eliminate stalls 2. Loop overhead encountered only once every 4 data iterations 3. This unrolling works if loop_limit mod 4 = 0 Loop Unrolled Scheduled Note: We just switched from integer instructions to single precision FP instructions! Loop: lwc1 $t0,0($s0) lwc1 $t1,4($s0) lwc1 $t2,8($s0) lwc1 $t3,12($s0) add.s $t0,$t0,$s1 add.s $t1,$t1,$s1 add.s $t2,$t2,$s1 add.s $t3,$t3,$s1 swc1 $t0,0($s0) swc1 $t1,4($s0) swc1 $t2,8($s0) swc1 $t3,12($s0) addiu $s0,$s0,16 bne $s0,$s2,loop 4 Loads side by side: Could replace with 4 wide SIMD Load 4 Adds side by side: Could replace with 4 wide SIMD Add 4 Stores side by side: Could replace with 4 wide SIMD Store 35 36 6

Loop Unrolling in C Instead of compiler doing loop unrolling, could do it yourself in C: for(i=0; i<1000; i++) x[i] = x[i] + s; Loop Unroll for(i=0; i<1000; i=i+4) { x[i] = x[i] + s; x[i+1] = x[i+1] + s; x[i+2] = x[i+2] + s; x[i+3] = x[i+3] + s; } What is downside of doing this in C? 37 Generalizing Loop Unrolling Take a loop of n iterations and perform a k fold unrolling of the body of the loop: First run the loop with k copies of the body floor(n/k) times To finish leftovers, then run the loop with 1 copy of the body n mod k times (Will revisit loop unrolling again when get to pipelining later in semester) 38 7