CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Similar documents
CS 3305 Intro to Threads. Lecture 6

CS333 Intro to Operating Systems. Jonathan Walpole

Tesla Architecture, CUDA and Optimization Strategies

Threads. Threads (continued)

Lecture 2: CUDA Programming

ECE 454 Computer Systems Programming

Cartoon parallel architectures; CPUs and GPUs

Threads. What is a thread? Motivation. Single and Multithreaded Processes. Benefits

Module Memory and Data Locality

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research


CS510 Operating System Foundations. Jonathan Walpole

Programmable Graphics Hardware (GPU) A Primer

PRACE Autumn School Basic Programming Models

Computer Architecture

CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double

Parallel Processing SIMD, Vector and GPU s cont.

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Threads. studykorner.org

High Performance Linear Algebra on Data Parallel Co-Processors I

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Introduction to pthreads

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

CUDA Programming. Aiichiro Nakano

Code Optimizations for High Performance GPU Computing

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Lecture 8: GPU Programming. CSE599G1: Spring 2017

ECE 408 / CS 483 Final Exam, Fall 2014

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

POSIX threads CS 241. February 17, Copyright University of Illinois CS 241 Staff

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

CS 179: GPU Computing. Lecture 2: The Basics

COSC 6374 Parallel Computations Introduction to CUDA

CSE-160 (Winter 2017, Kesden) Practice Midterm Exam. volatile int count = 0; // volatile just keeps count in mem vs register

GPU programming. Dr. Bernhard Kainz

Benchmarking the Memory Hierarchy of Modern GPUs

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

GPU Programming Using CUDA

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Multi-Processors and GPU

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

GPU Programming EE Final Examination

Introduction to GPU programming. Introduction to GPU programming p. 1/17

GPU Programming EE Final Examination

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CS 2410 Mid term (fall 2018)

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CSE 306/506 Operating Systems Threads. YoungMin Kwon

CUDA Parallelism Model

ECE 574 Cluster Computing Lecture 15

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA C Programming Mark Harris NVIDIA Corporation

Multithreaded Programming

Chapter 10: Virtual Memory. Lesson 05: Translation Lookaside Buffers

CMPSC 311- Introduction to Systems Programming Module: Concurrency

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Parallel Numerical Algorithms

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Hardware/Software Co-Design

CS510 Operating System Foundations. Jonathan Walpole

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Lecture 10!! Introduction to CUDA!


CMPSC 311- Introduction to Systems Programming Module: Concurrency

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Parallel Computing. Lecture 19: CUDA - I

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

A Thread is an independent stream of instructions that can be schedule to run as such by the OS. Think of a thread as a procedure that runs

ME964 High Performance Computing for Engineering Applications

COSC 462 Parallel Programming

CS 654 Computer Architecture Summary. Peter Kemper

Concurrency. Johan Montelius KTH

University of Bielefeld

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

What is concurrency? Concurrency. What is parallelism? concurrency vs parallelism. Concurrency: (the illusion of) happening at the same time.

Atomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA. Sathish Vadhiyar High Performance Computing

Parallel Accelerators

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Write only as much as necessary. Be brief!

Advanced and parallel architectures. Part B. Prof. A. Massini. June 13, Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points)

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Josef Pelikán, Jan Horáček CGG MFF UK Praha

HPCSE - I. «Introduction to multithreading» Panos Hadjidoukas

CS 179: GPU Programming. Lecture 7

Introduction to Parallel Computing with CUDA. Oswald Haan

Practical Introduction to CUDA and GPU

Lecture 5. Performance Programming with CUDA

CMPSC 311- Introduction to Systems Programming Module: Concurrency

Scientific discovery, analysis and prediction made possible through high performance computing.

Global Memory Access Pattern and Control Flow

Programming in CUDA. Malik M Khan

A Programmer s View of Shared and Distributed Memory Architectures

Transcription:

CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you need to review Questions 3, 4 and 5a from this exam, in addition to Questions 6 and 7 from last year s mid-term that was posted earlier. Questions 5b, 6, 7 and 8 of this exam relates to material that we did not cover as of yet. You should get back to them when you review for Exam III. 1

Question 3: (10 points) Consider a computer system that supports a 16MB virtual address space (byte addressable) with 1KB pages, a 256KB of physical memory and an 8-entry direct-mapped TLB. a) Assume that the TLB configuration is as follows: Valid Tag Page address 1 0000 0011 000 0010 0000 1 0000 0011 000 0000 0100 1 1111 0011 000 0000 1110 1 0011 0001 101 1111 0001 1 0011 1110 100 1010 0011 1 1010 1010 110 0101 0001 0 0000 0011 000 0101 0010 1 1111 0011 000 1100 0101 For each of the following virtual addresses, specify the virtual page number and find the physical page number. If any address causes a TLB miss, indicate so (do not find the physical address in such a case). Virtual address (byte address) Virtual page number Physical page number 0000 0011 0000 0100 0001 1001 1111 0011 0001 1111 1100 0100 0011 0001 1011 1110 1001 1101 b) A TLB miss may or may not result in a page fault. When does a TLB miss result in a page fault? c) What would be the main advantage and main disadvantage of replacing the direct-mapped TLB with a fully associative TLB? Advantage: Disadvantage: 2

Question 4 (15 points): Consider a 2GHz processor with separate instruction and data caches. We are focusing on improving the data cache performance assuming that the instruction cache achieves a 100% hit rate while our data cache achieves only a 90% hit rate. The processor is a d processor in which, on average, an instruction takes 3.5 cycles to execute if the cache hit rate is 100% (that is CPI = 3.5). (a) If the memory access latency is 40 n.sec., what is the data cache miss penalty (in cycles)? (b) Assuming that the cache miss penalty is x cycles (the number computed in part a), and that 25% of all instructions are memory instructions, write the formula that you would use to compute the CPI (write the formula in terms of x do not evaluate it). (c) To improve the overall performance, we decided to introduce a second level cache (L2). Its access latency is five cycle. The L2 cache hit rate is measured to be 50% (i.e., the probability of finding block in L2 when it misses in L1). Write the formula that you would use to compute the CPI in this case (keep the formula in terms of x do not evaluate it). (d) If the memory address (byte addressable) consists of 32 bits, b31,, b0, and the data cache is a 64KB, 4-way associative cache with 64B block size, specify the bits that should be used to index the cache and those that should be used as a tag. Bits for index are: b,, b Bits for tag are: b,, b b31 b0 3

Question 5: (10 points) (a) Consider a quad processor system where each processor (P0, P1, P2 and P3) has a private direct mapped cache and all cores share a single address space (shared memory system). Assume that memory address A maps to some cache block and that this block is initially invalid in all the caches. Specify the state (I=invalid, S=shared or E=exclusive/modified) of the block in the cache of each processor after each of the following sequence of memory operations: In P0 In P1 In P2 In P3 Initially I I I I P1 reads A P2 reads A P3 writes A P0 writes A P1 reads A P0 writes A (b) Let the distance between two nodes be defined as the number of links between the nodes. What is the diameter and what is the bisection width of each of the networks shown below? Diameter = Diameter = Bisection width = Bisection width = 4

Question 6: (10 points) (a) Consider the following skeleton for a CUDA program _global_void my_kernel (int *a, int * b) { int idx = blockidx.x * blockdim.x + threadidx.x ; /*note that blockdim.x = 2 */ a[idx+1] = threadidx.x ; b[idx] = blockidx.x ; void main(){ my_kernel<<< 4,2 >>> (a, b) /* four blocks, each containing 2 threads */ Assuming that arrays a[] and b[] are allocated in the global memory and are initialized to -1, what will be the values of their elements after my_kernel finishes execution? a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] b[0] b[1] b[2] b[3] b[4] b[5] b[6] b[7] b[8] (b) For each of the following statements, circle the correct answer - The _syncthreads() CUDA barrier synchronizes all the treads of a: (i) kernel (ii) thread block (iii) warp - In a CUDA kernel, it is more efficient to use a block size of (i) 64 threads (ii) 16 threads (iii) 8 threads - In a GPU, the shared memory is shared among all the threads of a (i) kernel (ii) thread block (iii) warp - cudamalloc() is used to dynamically allocate space in (i) shared memory (ii) global memory (iii) CPU memory - All the threads in a thread block execute on the same (i) SM - streaming (ii) SP - streaming (iii) device multiprocessor processor - The number of registers in an SM determines the maximum number of allowed on the SM (i) warps (ii) thread blocks (iii) threads 5

Question 7: (20 points) Consider the following Pthread program which computes the sum of N numbers, using P threads. The sum will be computed in sum[0] : #define P 8 /* P is a power of 2 */ #define N 1024 /* N is a power of 2 */ void *compute_sum ( void *); struct arg_to_thread {int id ; float A[N]; /* assume that A[] is initialized to some values */ float sum[p] ; main (int argc, char *argv[] ) { int i ; pthread_t p_threads[p]; pthread_attr_t attr; struct arg_to_thread my_arg[p] ; pthread_attr_init (&attr); for (i=0; i < P ; i++ ){ my_arg[i].id = i ; pthread_create (&p_threads[i], &attr, compute_sum, (void*) &my_arg[i]); for (i=0; i< num_threads; i++) pthread_join (p_threads[i], NULL); void *compute_sum (void *arg) { struct arg_to_thread *local_arg ; int i, half, idx ; local_arg = arg; idx = (*local_arg).id; /* idx is the id given to the thread */ for (i = ; i < ; i++) /* compute a partial sum */ /*line 1*/ sum[ ] = sum[ ] + A[i] ; /*line 2*/ half = P/2 ; /*line 3*/ for (i = ; ; i++) { /* compute the global sum */ /*line 4*/ if( ) { /*line 5*/ sum[ ] = sum[ ] + sum[ ] ; /*line 6*/ half = ; /*line 7*/ /*line 8*/ 6

(a) Ignoring the need for synchronization, complete the lines labeled /*line 1*/ to /*line 7*/ in the function compute_sum such that the sum of the 1024 numbers is computed in sum[0]. The sum is computed by forking 8 threads, each of which computes the sum of 128 numbers (in parallel). The 8 partial sums are then added together using a tree reduction algorithm. (b) Indicate after which line(s) you should add barrier synchronizations and explain why are these barriers needed? (c) What is the speedup and efficiency obtained from the parallel execution when N = 1024 and P = 8? Serial execution time = Execution time using 8 processors = steps steps Speedup = Efficiency = (d) Assuming that you can use as many processors as you want, what is the maximum speedup that can be obtained to solve the problem for N = 1024? (e) Amdahl law indicates that if is the fraction of the task that has to execute serially, then the maximum speedup that can be obtained is 1/. Why can t we apply Amdahl s law to obtain the answer to part (d)?. 7

Question 8 (10 points): Consider a superscalar architecture with two d units, one for load/store and one for instructions. The following two tables indicate the order of execution of two threads, A and B, and the latencies mandated by dependences between instructions. For example, A2 and A3 should execute after A1 and there should be at least four cycles between the execution of A3 and A5 and at least three cycles between A4 and A5. In other words, the tables indicate the schedule if the instructions of each of the threads are executed with no multithreading. Note that an instruction that executes on one (for example A1 on the load/store ) cannot execute on the other. time Load/store t A1 t+1 A3 A2 t+2 A4 t+3 t+4 t+5 t+6 A5 t+7 A6 A7 time Load/store t B1 B2 t+1 B3 t+2 B4 B5 t+3 B6 t+4 B7 t+5 B8 t+6 t+7 Show the execution schedule for the two threads on the two s assuming: (a) Coarse grain multithreading (b) fine grain multithreading (c) Simultaneous multithreading (with priority given to thread A) time Load/store time Load/store time t t t t+1 t+1 t+1 t+2 t+2 t+2 t+3 t+3 t+3 t+4 t+4 t+4 t+5 t+5 t+5 t+6 t+6 t+6 t+7 t+7 t+7 t+8 t+8 t+8 t+9 t+9 t+9 t+10 t+10 t+10 t+11 t+11 t+11 t+12 t+12 t+12 t+13 t+13 t+13 t+14 t+14 t+14 Load/store 8