Chapter 4 Data-Level Parallelism

Similar documents
! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Vector Architectures. Intensive Computation. Annalisa Massini 2017/2018

Parallel Processing SIMD, Vector and GPU s

Static Compiler Optimization Techniques

EE 4683/5683: COMPUTER ARCHITECTURE

Lecture 20: Data Level Parallelism -- Introduction and Vector Architecture

Parallel Processing SIMD, Vector and GPU s

Advanced Topics in Computer Architecture

Online Course Evaluation. What we will do in the last week?

Lecture 21: Data Level Parallelism -- SIMD ISA Extensions for Multimedia and Roofline Performance Model

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

POLITECNICO DI MILANO. Advanced Topics on Heterogeneous System Architectures! Multiprocessors

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Unit 9 : Fundamentals of Parallel Processing

Instruction and Data Streams

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

6.189 IAP Lecture 3. Introduction to Parallel Architectures. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

CS 252 Graduate Computer Architecture. Lecture 17 Parallel Processors: Past, Present, Future

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Advanced Computer Architecture

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES

Issues in Multiprocessors

POLITECNICO DI MILANO. Advanced Topics on Heterogeneous System Architectures! Multiprocessors

Issues in Multiprocessors

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

Vector Processors. Abhishek Kulkarni Girish Subramanian

Copyright 2012, Elsevier Inc. All rights reserved.

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

! Readings! ! Room-level, on-chip! vs.!

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Asanovic/Devadas Spring Vector Computers. Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology

Data-Level Parallelism in Vector and SIMD Architectures

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallel Systems I The GPU architecture. Jan Lemeire

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

Computer Architecture

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

Module 5 Introduction to Parallel Processing Systems

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Data-Level Parallelism in Vector and SIMD Architectures

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Classification of Parallel Architecture Designs

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Parallel Computing: Parallel Architectures Jin, Hai

Overview. Processor organizations Types of parallel machines. Real machines

Parallel computer architecture classification

Fundamentals of Computers Design

EECS4201 Computer Architecture

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Intro to Multiprocessors

EC 513 Computer Architecture

Lecture 1: Introduction

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Chap. 4 Multiprocessors and Thread-Level Parallelism

ARCHITECTURES FOR PARALLEL COMPUTATION

COMPUTER ORGANIZATION AND DESIGN

Processor Architecture and Interconnect

Architectures of Flynn s taxonomy -- A Comparison of Methods

CS Parallel Algorithms in Scientific Computing

A taxonomy of computer architectures

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Fundamentals of Computer Design

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Flynn s Classification

Parallel Processing SIMD, Vector and GPU s cont.

Multiprocessors & Thread Level Parallelism

CMPE 655 Multiple Processor Systems. SIMD/Vector Machines. Daniel Terrance Stephen Charles Rajkumar Ramadoss

Parallel Computing Platforms

RAID 0 (non-redundant) RAID Types 4/25/2011

UC Berkeley CS61C : Machine Structures

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Vector Processing. Computer Organization Architectures for Embedded Computing. Friday, 13 December 13

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

Fundamentals of Quantitative Design and Analysis

Vector Processors and Graphics Processing Units (GPUs)

Multi-core Programming - Introduction

THREAD LEVEL PARALLELISM

Architecture of parallel processing in computer organization

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

UC Berkeley CS61C : Machine Structures

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek. EECS, University of California at Berkeley

Advanced Computer Architecture. The Architecture of Parallel Computers

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

BlueGene/L (No. 4 in the Latest Top500 List)

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

CS 426 Parallel Computing. Parallel Computing Platforms

Memory Systems IRAM. Principle of IRAM

ARCHITECTURES FOR PARALLEL COMPUTATION

Transcription:

CS359: Computer Architecture Chapter 4 Data-Level Parallelism Yanyan Shen Department of Computer Science and Engineering Shanghai Jiao Tong University 1

Outline 4.1 Introduction 4.2 Vector Architecture 4.3 SIMD Instruction Set Extensions 4.4 GPU 2

Flynn s Classification (1966) SISD: Single Instruction, Single Data Conventional uniprocessor SIMD: Single Instruction, Multiple Data One instruction stream, multiple data paths Distributed memory SIMD (MPP, DAP, CM-1&2, Maspar) Shared memory SIMD (STARAN, vector computers) MIMD: Multiple Instruction, Multiple Data Message passing machines (Transputers, ncube, CM-5) Non-cache-coherent shared memory machines (BBN Butterfly, T Cache-coherent shared memory machines (Sequent, Sun Starfire SGI Origin) MISD: Multiple Instruction, Single Data Not a practical configuration 3

Approaches to Exploiting Parallelism Data- Level Parallel ism (DLP) Thread- Level Paralleli sm (TLP) Execute independent instruction streams in parallel (multithreading, multiple cores) Execute multiple operations of the same type in parallel (vector/simd execution) Execute independent instructions from one instruction stream in parallel (pipelining, superscalar, VLIW) Instruction- Level Parallelism (ILP) 4

Resurgence of DLP Convergence of application demands and technology constraints drives architecture choice New applications, such as graphics, machine vision, speech recognition, machine learning, etc. all require large numerical computations that are often trivially data parallel SIMD-based architectures (vector-simd, subword-simd, GPUs) are most efficient ways to execute these algorithms 5

Introduction SIMD architectures can exploit significant datalevel parallelism for: matrix-oriented scientific computing media-oriented image and sound processors SIMD is more energy efficient than MIMD Only needs to fetch one instruction for data operations Makes SIMD attractive for personal mobile devices Advantage compared to MIMD: SIMD allows programmer to continue to think sequentially 6

SIMD Single Instruction Multiple Data architectures make use of data parallelism We care about SIMD because of area and power efficiency concerns Amortize control overhead over SIMD width However, parallelism exposed to programmer & compiler 7

SIMD Parallelism 1) Vector architectures 2) SIMD extensions 3) Graphics Processor Units (GPUs) 8

Outline 4.1 Introduction 4.2 Vector Architecture 4.3 SIMD Instruction Set Extensions 4.4 GPU 9

Vector Architectures Basic idea: Read sets of data elements into vector registers Operate on those registers Disperse the results back into memory Registers are controlled by compiler Used to hide memory latency Leverage memory bandwidth 10

Vector Supercomputers Epitomized by Cray-1, 1976: Seymour Roger Cray Father of supercomputers Scalar Unit + Vector Extensions Load/Store Architecture Vector Registers Vector Instructions Hardwired Control Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory The initial model weighed 5.5 tons including the Freon refrigeration system 11

Cray-1 (1976) 12

T0 (Torrent) Vector Microprocessor (1995) 13

Vector Supercomputer: NEC SX-6 (2003) 14

Earth System Simulator (NEC, ~2002) 15

Basic structure of VMIPS This processor has a scalar architecture just like MIPS. There are also eight 64- element vector registers, and vector functional units. The vector and scalar registers have a significant number of read and write ports to allow multiple simultaneous vector operations. A set of crossbar switches (thick gray lines) connects these ports to the inputs and outputs of the vector functional units. 16

Architecture: VMIPS (running example) Loosely based on Cray-1 Vector registers Each register holds a 64-element, 64 bits/element vector Register file has 16 read ports and 8 write ports Vector functional units Fully pipelined Data and control hazards are detected Vector load-store unit Fully pipelined One word per clock cycle after initial latency Scalar registers 32 general-purpose registers 32 floating-point registers 17

Vector Programming Model 18

Vector Instructions ADDVV.D: add two vectors ADDVS.D: add vector to a scalar LV/SV: vector load and vector store from address 19

Vector Instruction Set Advantages Compact one short instruction encodes N operations Expressive, tells hardware that these N operations: are independent use the same functional unit access disjoint registers (elements) access registers in the same pattern as previous instructions access a contiguous block of memory (unit-stride load/store) access memory in a known pattern (strided load/store) Scalable can run same object code on more parallel pipelines or lanes i.e., the code is machine independent 20

Example: DAXPY DAXPY: Double Precision a*x plus Y 64 elements in total 21

How Vector Processor Work: An Example MIPS code: 22

How Vector Processor Work: An Example VMIPS code: 600 instructions -> 6 instructions 23

Vector Arithmetic Execution Note: not always parallel hardware (1 lane) Deep pipeline also helps! 24

Vector Execution Time Execution time of a sequence of vector operations depends on three factors: Length of operand vectors Structural hazards Data dependencies VMIPS functional units consume one element per clock cycle Execution time is approximately the vector length 25

Vector Chaining 26

Vector Chaining Advantage 27

Optimizations How can a vector processor execute a single vector faster than one element per clock cycle? Multiple elements per clock cycle improve performance 28

Vector Instruction Execution 29

Vector Unit Structure Multiple Lanes 30

T0 (Torrent) Vector Microprocessor (1995) 31

Optimizations (Cont ) How do you program a vector computer??? 32

Automatic Code Vectorization 33

Programming Vector Architectures Compilers can provide feedback to programmers Programmers can provide hints to compiler 34

Optimizations (Cont ) How does a vector processor handle programs where the vector lengths are not the same as the length of the vector register (64 for VMIPS)? Most application vectors do not match the architecture vector length 35

Vector Length Register Vector length not known at compile time? Use Vector Length Register (VLR) to control the length of any vector operation, including a vector load and store 36

Vector Stripmining 37