Advanced and parallel architectures

Similar documents
Advanced and parallel architectures. Part B. Prof. A. Massini. June 13, Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points)

Multi-Processors and GPU

Keywords and Review Questions

Warps and Reduction Algorithms

Numerical Simulation on the GPU

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Introduction to GPU programming with CUDA

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Generic Polyphase Filterbanks with CUDA

UNIT I (Two Marks Questions & Answers)

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Parallel Processing SIMD, Vector and GPU s cont.

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Computer Systems Architecture

Multiprocessors & Thread Level Parallelism

CS 654 Computer Architecture Summary. Peter Kemper

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

ECE/CS 757: Homework 1

Computer Architecture Spring 2016

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Parallel Computing: Parallel Architectures Jin, Hai

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Computer Systems Architecture

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

1.3 Data processing; data storage; data movement; and control.

Computer Architecture

Lecture 24: Board Notes: Cache Coherency

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Wide operands. CP1: hardware can multiply 64-bit floating-point numbers RAM MUL. core

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 24: Virtual Memory, Multiprocessors

An Introduction to Parallel Architectures

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Graphics Processors

CSC526: Parallel Processing Fall 2016

Chap. 4 Multiprocessors and Thread-Level Parallelism

BlueGene/L (No. 4 in the Latest Top500 List)

GPU Fundamentals Jeff Larkin November 14, 2016

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Designing for Performance. Patrick Happ Raul Feitosa

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Tesla Architecture, CUDA and Optimization Strategies

Power 7. Dan Christiani Kyle Wieschowski

Course on Advanced Computer Architectures

Arquitetura e Organização de Computadores 2

Course on Advanced Computer Architectures

NVIDIA Fermi Architecture

An Overview of Parallel Computing

Evaluation Of The Performance Of GPU Global Memory Coalescing

CS 179: GPU Computing

B. Tech. Project Second Stage Report on

Effect of memory latency

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Optimizing CUDA for GPU Architecture. CSInParallel Project

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

COSC4201 Multiprocessors

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

Dense matching GPU implementation

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Handout 3 Multiprocessor and thread level parallelism

CS 179 Lecture 4. GPU Compute Architecture

ETH, Design of Digital Circuits, SS17 Review Session Questions I

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Heterogeneous-Race-Free Memory Models

Multiprocessor Cache Coherency. What is Cache Coherence?

Computer Architecture

Performance, Power, Die Yield. CS301 Prof Szajda

RECAP. B649 Parallel Architectures and Programming

EECS 570 Final Exam - SOLUTIONS Winter 2015

Some features of modern CPUs. and how they help us

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Parallel and Distributed Programming Introduction. Kenjiro Taura

Parallel Accelerators

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Programming in CUDA. Malik M Khan

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

NPTEL. High Performance Computer Architecture - Video course. Computer Science and Engineering.

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

CUDA. Matthew Joyner, Jeremy Williams

45-year CPU Evolution: 1 Law -2 Equations

Chapter 1. Computer Abstractions and Technology. Lesson 2: Understanding Performance

Computer Architecture CS372 Exam 3

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Transcription:

Cognome Nome Advanced and parallel architectures Prof. A. Massini June 11, 2015 Exercise 1a (2 points) Exercise 1b (2 points) Exercise 2 (5 points) Exercise 3 (3 points) Exercise 4a (3 points) Exercise 4a (2 points) Exercise 5 (4 points) Exercise 6 (5 points) Question 1 (3 points) Question 2 (3 points) Total (32 points)

Exercise 1 (2+2 points) Loop Dependences a) Explain what are output dependences and antidependences in a loop. b) In the following loop, find all the output dependences and antidependences. Eliminate the output dependences and antidependences by renaming. for (i=0;i<100;i++) { X[i] = X[i] / Z[i]; /* S1 */ Z[i] = X[i] - k; /* S2 */ X[i] = W[i] + k; /* S3 */ W[i] = Y[i] / X[i]; /* S4 */

Exercise 2 (5 points) - Data Flow Machine Consider the computation of the determinant det of a matrix A=(a i,j ) of size 3x3 on a Dataflow Machine. Write the instructions needed to obtain det according to the Dataflow Machine format, group the instructions according to the parallel steps, and draw the diagram for the execution.

Exercise 3 (3 points) - GPU & CUDA A CUDA device s SM (Streaming Multiprocessor) can take up to 2048 threads and up to 8 thread blocks. Which of the following block configuration would result in the most number of threads in the SM? (A) 128 threads per block (B) 256 threads per block (C) 512 threads per block (D) 1024 threads per block Give a comment for each answer.

Exercise 4 (3+2 points) - GPU & CUDA You need to write a kernel that operates on a matrix of size 500x800. You would like to assign one thread to each matrix element. You would like your thread blocks to use the maximum number of threads per block possible on your device, having compute capability 1.3, assuming bi-dimensional blocks. a) How would you select the grid dimensions and block dimensions of your kernel? b) How many idle threads do you expect to have? Technical specifications Compute capability (version) 1.0 1.1 1.2 1.3 2.x 3.0 3.5 3.7 5.0 5.2 Maximum dimensionality of grid of thread blocks 2 3 Maximum x-dimension of a grid of thread blocks 65535 2 31-1 Maximum y-, or z-dimension of a grid of thread blocks 65535 Maximum dimensionality of thread block 3 Maximum x- or y-dimension of a block 512 1024 Maximum z-dimension of a block 64 Maximum number of threads per block 512 1024 Warp size 32 Maximum number of resident blocks per multiprocessor 8 16 32 Maximum number of resident warps per multiprocessor 24 32 48 64 Maximum number of resident threads per multiprocessor 768 1024 1536 2048 Technical specifications 1.0 1.1 1.2 1.3 2.x 3.0 3.5 3.7 5.0 5.2 Compute capability (version)

Exercise 5 (4 points) - Snooping protocol for cache coherence Consider a multicore multiprocessor implemented as a symmetric shared-memory architecture, as illustrated in the figure. P0 Coherency Address tag Data B0 I 100 00 08 B1 S 104 00 04 B2 M 108 00 36 B3 I 112 00 08 P1 Coherency Address tag Data B0 I 100 00 08 B1 M 120 00 56 B2 I 108 00 08 B3 S 112 00 16 P3 Coherency Address tag Data B0 S 116 00 20 B1 S 104 00 04 B2 I 108 00 36 B3 I 112 00 08 On chip interconnect (with coherency manager) Memory Address Data 100 00 08 104 00 04 108 00 08 112 00 16 116 00 20 120 00 28 124 00 36 Each processor has a single, private cache with coherence maintained using the snooping coherence protocol. Each cache is direct-mapped, with four blocks each holding two words. The coherence s are denoted M, S, and I (Modified, Shared, and Invalid). Each part of this exercise specifies a sequence of one or more CPU operations of the form: P#: <op> <address> [<value>] where P# designates the CPU (e.g., P0), <op> is the CPU operation (e.g., read or write), <address> denotes the memory address, and <value> indicates the new word to be assigned on a write operation. For each part of this exercise, assume the initial cache and memory as illustrated in the fgure and treat each action as independently applied to the initial shown in the figure. Show in the table for the results: - miss/hit - the coherence before the action - the CPU processor Pi and cache block Bj - the changed (i.e., coherence, tags, and data) of the caches and memory after the given action. Specify the value returned by a read operation.

a) P0: write 104 24 hit/miss before Pi.Bj (, tag, datawords) b) P1: read 108 hit/miss before Pi.Bj (, tag, datawords) value returned by read: c) P0: write 124 12 hit/miss before Pi.Bj (, tag, datawords) d) P0: write 116 32 hit/miss before Pi.Bj (, tag, datawords) Comments

Exercises 6 (5 points) Performance equation & Amdhal Law Consider a machine having a clock rate of 400 MHz. The following measurements are recorded with respect to the different instruction classes for the instruction set running a given set of benchmark programs: Instruction Type Instruction Count (millions) Cycles per Instruction Arithmetic and logic Load and store Branch Others 6 4 1 5 1 3 4 3 a) Determine the effective CPI and execution time. (2 points) b) Knowing that we can express the MIPS rate in terms of the clock rate and CPI as follows: IC f MIPS rate 6 6 T 10 CPI 10 determine the MIPS rate for the machine above. (1 points) c) Assume that instructions load and store can be modified so that they take 1 cycle per instruction instead of 3 as in the table. Compute the speed up obtained by introducing this enhancement using the Amdhal law. (2 points)

Question 1 (3points) Describe the Looping algorithm for the Benes network

Question 2 (3 points) Describe the directory based protocol for Cache coherence