CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4
|
|
- Kory Johnson
- 5 years ago
- Views:
Transcription
1 CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you need to review Questions 3, 4 and 5a from this exam, in addition to Questions 6 and 7 from last year s mid-term that was posted earlier. Questions 5b, 6, 7 and 8 of this exam relates to material that we did not cover as of yet. You should get back to them when you review for Exam III. 1
2 Question 3: (10 points) Consider a computer system that supports a 16MB virtual address space (byte addressable) with 1KB pages, a 256KB of physical memory and an 8-entry direct-mapped TLB. a) Assume that the TLB configuration is as follows: Valid Tag Page address For each of the following virtual addresses, specify the virtual page number and find the physical page number. If any address causes a TLB miss, indicate so (do not find the physical address in such a case). Virtual address (byte address) Virtual page number Physical page number b) A TLB miss may or may not result in a page fault. When does a TLB miss result in a page fault? c) What would be the main advantage and main disadvantage of replacing the direct-mapped TLB with a fully associative TLB? Advantage: Disadvantage: 2
3 Question 4 (15 points): Consider a 2GHz processor with separate instruction and data caches. We are focusing on improving the data cache performance assuming that the instruction cache achieves a 100% hit rate while our data cache achieves only a 90% hit rate. The processor is a d processor in which, on average, an instruction takes 3.5 cycles to execute if the cache hit rate is 100% (that is CPI = 3.5). (a) If the memory access latency is 40 n.sec., what is the data cache miss penalty (in cycles)? (b) Assuming that the cache miss penalty is x cycles (the number computed in part a), and that 25% of all instructions are memory instructions, write the formula that you would use to compute the CPI (write the formula in terms of x do not evaluate it). (c) To improve the overall performance, we decided to introduce a second level cache (L2). Its access latency is five cycle. The L2 cache hit rate is measured to be 50% (i.e., the probability of finding block in L2 when it misses in L1). Write the formula that you would use to compute the CPI in this case (keep the formula in terms of x do not evaluate it). (d) If the memory address (byte addressable) consists of 32 bits, b31,, b0, and the data cache is a 64KB, 4-way associative cache with 64B block size, specify the bits that should be used to index the cache and those that should be used as a tag. Bits for index are: b,, b Bits for tag are: b,, b b31 b0 3
4 Question 5: (10 points) (a) Consider a quad processor system where each processor (P0, P1, P2 and P3) has a private direct mapped cache and all cores share a single address space (shared memory system). Assume that memory address A maps to some cache block and that this block is initially invalid in all the caches. Specify the state (I=invalid, S=shared or E=exclusive/modified) of the block in the cache of each processor after each of the following sequence of memory operations: In P0 In P1 In P2 In P3 Initially I I I I P1 reads A P2 reads A P3 writes A P0 writes A P1 reads A P0 writes A (b) Let the distance between two nodes be defined as the number of links between the nodes. What is the diameter and what is the bisection width of each of the networks shown below? Diameter = Diameter = Bisection width = Bisection width = 4
5 Question 6: (10 points) (a) Consider the following skeleton for a CUDA program _global_void my_kernel (int *a, int * b) { int idx = blockidx.x * blockdim.x + threadidx.x ; /*note that blockdim.x = 2 */ a[idx+1] = threadidx.x ; b[idx] = blockidx.x ; void main(){ my_kernel<<< 4,2 >>> (a, b) /* four blocks, each containing 2 threads */ Assuming that arrays a[] and b[] are allocated in the global memory and are initialized to -1, what will be the values of their elements after my_kernel finishes execution? a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] b[0] b[1] b[2] b[3] b[4] b[5] b[6] b[7] b[8] (b) For each of the following statements, circle the correct answer - The _syncthreads() CUDA barrier synchronizes all the treads of a: (i) kernel (ii) thread block (iii) warp - In a CUDA kernel, it is more efficient to use a block size of (i) 64 threads (ii) 16 threads (iii) 8 threads - In a GPU, the shared memory is shared among all the threads of a (i) kernel (ii) thread block (iii) warp - cudamalloc() is used to dynamically allocate space in (i) shared memory (ii) global memory (iii) CPU memory - All the threads in a thread block execute on the same (i) SM - streaming (ii) SP - streaming (iii) device multiprocessor processor - The number of registers in an SM determines the maximum number of allowed on the SM (i) warps (ii) thread blocks (iii) threads 5
6 Question 7: (20 points) Consider the following Pthread program which computes the sum of N numbers, using P threads. The sum will be computed in sum[0] : #define P 8 /* P is a power of 2 */ #define N 1024 /* N is a power of 2 */ void *compute_sum ( void *); struct arg_to_thread {int id ; float A[N]; /* assume that A[] is initialized to some values */ float sum[p] ; main (int argc, char *argv[] ) { int i ; pthread_t p_threads[p]; pthread_attr_t attr; struct arg_to_thread my_arg[p] ; pthread_attr_init (&attr); for (i=0; i < P ; i++ ){ my_arg[i].id = i ; pthread_create (&p_threads[i], &attr, compute_sum, (void*) &my_arg[i]); for (i=0; i< num_threads; i++) pthread_join (p_threads[i], NULL); void *compute_sum (void *arg) { struct arg_to_thread *local_arg ; int i, half, idx ; local_arg = arg; idx = (*local_arg).id; /* idx is the id given to the thread */ for (i = ; i < ; i++) /* compute a partial sum */ /*line 1*/ sum[ ] = sum[ ] + A[i] ; /*line 2*/ half = P/2 ; /*line 3*/ for (i = ; ; i++) { /* compute the global sum */ /*line 4*/ if( ) { /*line 5*/ sum[ ] = sum[ ] + sum[ ] ; /*line 6*/ half = ; /*line 7*/ /*line 8*/ 6
7 (a) Ignoring the need for synchronization, complete the lines labeled /*line 1*/ to /*line 7*/ in the function compute_sum such that the sum of the 1024 numbers is computed in sum[0]. The sum is computed by forking 8 threads, each of which computes the sum of 128 numbers (in parallel). The 8 partial sums are then added together using a tree reduction algorithm. (b) Indicate after which line(s) you should add barrier synchronizations and explain why are these barriers needed? (c) What is the speedup and efficiency obtained from the parallel execution when N = 1024 and P = 8? Serial execution time = Execution time using 8 processors = steps steps Speedup = Efficiency = (d) Assuming that you can use as many processors as you want, what is the maximum speedup that can be obtained to solve the problem for N = 1024? (e) Amdahl law indicates that if is the fraction of the task that has to execute serially, then the maximum speedup that can be obtained is 1/. Why can t we apply Amdahl s law to obtain the answer to part (d)?. 7
8 Question 8 (10 points): Consider a superscalar architecture with two d units, one for load/store and one for instructions. The following two tables indicate the order of execution of two threads, A and B, and the latencies mandated by dependences between instructions. For example, A2 and A3 should execute after A1 and there should be at least four cycles between the execution of A3 and A5 and at least three cycles between A4 and A5. In other words, the tables indicate the schedule if the instructions of each of the threads are executed with no multithreading. Note that an instruction that executes on one (for example A1 on the load/store ) cannot execute on the other. time Load/store t A1 t+1 A3 A2 t+2 A4 t+3 t+4 t+5 t+6 A5 t+7 A6 A7 time Load/store t B1 B2 t+1 B3 t+2 B4 B5 t+3 B6 t+4 B7 t+5 B8 t+6 t+7 Show the execution schedule for the two threads on the two s assuming: (a) Coarse grain multithreading (b) fine grain multithreading (c) Simultaneous multithreading (with priority given to thread A) time Load/store time Load/store time t t t t+1 t+1 t+1 t+2 t+2 t+2 t+3 t+3 t+3 t+4 t+4 t+4 t+5 t+5 t+5 t+6 t+6 t+6 t+7 t+7 t+7 t+8 t+8 t+8 t+9 t+9 t+9 t+10 t+10 t+10 t+11 t+11 t+11 t+12 t+12 t+12 t+13 t+13 t+13 t+14 t+14 t+14 Load/store 8
CS 3305 Intro to Threads. Lecture 6
CS 3305 Intro to Threads Lecture 6 Introduction Multiple applications run concurrently! This means that there are multiple processes running on a computer Introduction Applications often need to perform
More informationCS333 Intro to Operating Systems. Jonathan Walpole
CS333 Intro to Operating Systems Jonathan Walpole Threads & Concurrency 2 Threads Processes have the following components: - an address space - a collection of operating system state - a CPU context or
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationThreads. Threads (continued)
Threads A thread is an alternative model of program execution A process creates a thread through a system call Thread operates within process context Use of threads effectively splits the process state
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationECE 454 Computer Systems Programming
ECE 454 Computer Systems Programming The Edward S. Rogers Sr. Department of Electrical and Computer Engineering Final Examination Fall 2011 Name Student # Professor Greg Steffan Answer all questions. Write
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationThreads. What is a thread? Motivation. Single and Multithreaded Processes. Benefits
CS307 What is a thread? Threads A thread is a basic unit of CPU utilization contains a thread ID, a program counter, a register set, and a stack shares with other threads belonging to the same process
More informationModule Memory and Data Locality
GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles
More informationReductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research
Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationCS510 Operating System Foundations. Jonathan Walpole
CS510 Operating System Foundations Jonathan Walpole The Process Concept 2 The Process Concept Process a program in execution Program - description of how to perform an activity instructions and static
More informationProgrammable Graphics Hardware (GPU) A Primer
Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism
More informationPRACE Autumn School Basic Programming Models
PRACE Autumn School 2010 Basic Programming Models Basic Programming Models - Outline Introduction Key concepts Architectures Programming models Programming languages Compilers Operating system & libraries
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationCSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double
CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double *)malloc(sizeof(double)*n*n); B = (double *)malloc(sizeof(double)*n*n);
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationExam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence
Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,
More informationIdentifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011
Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for
More informationThreads. studykorner.org
Threads Thread Subpart of a process Basic unit of CPU utilization Smallest set of programmed instructions, can be managed independently by OS No independent existence (process dependent) Light Weight Process
More informationHigh Performance Linear Algebra on Data Parallel Co-Processors I
926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationCOSC 6385 Computer Architecture. - Data Level Parallelism (II)
COSC 6385 Computer Architecture - Data Level Parallelism (II) Fall 2013 SIMD Instructions Originally developed for Multimedia applications Same operation executed for multiple data items Uses a fixed length
More informationIntroduction to pthreads
CS 220: Introduction to Parallel Computing Introduction to pthreads Lecture 25 Threads In computing, a thread is the smallest schedulable unit of execution Your operating system has a scheduler that decides
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More informationCUDA Programming. Aiichiro Nakano
CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science
More informationCode Optimizations for High Performance GPU Computing
Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate
More informationCOMP 605: Introduction to Parallel Computing Lecture : GPU Architecture
COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:
More informationOutline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun
Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationECE 408 / CS 483 Final Exam, Fall 2014
ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across
More informationCS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University
CS 333 Introduction to Operating Systems Class 3 Threads & Concurrency Jonathan Walpole Computer Science Portland State University 1 Process creation in UNIX All processes have a unique process id getpid(),
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationPOSIX threads CS 241. February 17, Copyright University of Illinois CS 241 Staff
POSIX threads CS 241 February 17, 2012 Copyright University of Illinois CS 241 Staff 1 Recall: Why threads over processes? Creating a new process can be expensive Time A call into the operating system
More informationCUDA Advanced Techniques 3 Mohamed Zahran (aka Z)
Some slides are used and slightly modified from: NVIDIA teaching kit CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 3 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationCS 179: GPU Computing. Lecture 2: The Basics
CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationCSE-160 (Winter 2017, Kesden) Practice Midterm Exam. volatile int count = 0; // volatile just keeps count in mem vs register
Full Name: @ucsd.edu PID: CSE-160 (Winter 2017, Kesden) Practice Midterm Exam 1. Threads, Concurrency Consider the code below: volatile int count = 0; // volatile just keeps count in mem vs register void
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationBenchmarking the Memory Hierarchy of Modern GPUs
1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong
More informationCE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture
CE 431 Parallel Computer Architecture Spring 2017 Graphics Processor Units (GPU) Architecture Nikos Bellas Computer and Communications Engineering Department University of Thessaly Some slides borrowed
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationStanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690
Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationCS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University
CS 333 Introduction to Operating Systems Class 3 Threads & Concurrency Jonathan Walpole Computer Science Portland State University 1 The Process Concept 2 The Process Concept Process a program in execution
More informationGPU Programming EE Final Examination
Name GPU Programming EE 4702-1 Final Examination Tuesday, 9 December 2014 7:30 9:30 CST Alias Problem 1 Problem 2 Problem 3 Problem 4 Problem 5 Exam Total (20 pts) (15 pts) (20 pts) (20 pts) (25 pts) (100
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationGPU Programming EE Final Examination
Name GPU Programming EE 4702-1 Final Examination Friday, 11 December 2015 15:00 17:00 CST Alias Problem 1 Problem 2 Problem 3 Problem 4 Problem 5 Problem 6 Exam Total (20 pts) (15 pts) (15 pts) (20 pts)
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture
More informationCS 2410 Mid term (fall 2018)
CS 2410 Mid term (fall 2018) Name: Question 1 (6+6+3=15 points): Consider two machines, the first being a 5-stage operating at 1ns clock and the second is a 12-stage operating at 0.7ns clock. Due to data
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationCSE 306/506 Operating Systems Threads. YoungMin Kwon
CSE 306/506 Operating Systems Threads YoungMin Kwon Processes and Threads Two characteristics of a process Resource ownership Virtual address space (program, data, stack, PCB ) Main memory, I/O devices,
More informationCUDA Parallelism Model
GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationMultithreaded Programming
Multithreaded Programming The slides do not contain all the information and cannot be treated as a study material for Operating System. Please refer the text book for exams. September 4, 2014 Topics Overview
More informationChapter 10: Virtual Memory. Lesson 05: Translation Lookaside Buffers
Chapter 10: Virtual Memory Lesson 05: Translation Lookaside Buffers Objective Learn that a page table entry access increases the latency for a memory reference Understand that how use of translationlookaside-buffers
More informationCMPSC 311- Introduction to Systems Programming Module: Concurrency
CMPSC 311- Introduction to Systems Programming Module: Concurrency Professor Patrick McDaniel Fall 2013 Sequential Programming Processing a network connection as it arrives and fulfilling the exchange
More informationParallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Challenges Difficult to write parallel programs Most programmers think sequentially
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationHardware/Software Co-Design
1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled
More informationCS510 Operating System Foundations. Jonathan Walpole
CS510 Operating System Foundations Jonathan Walpole Threads & Concurrency 2 Why Use Threads? Utilize multiple CPU s concurrently Low cost communication via shared memory Overlap computation and blocking
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationLecture 10!! Introduction to CUDA!
1(50) Lecture 10 Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY 1(50) Laborations Some revisions may happen while making final adjustments for Linux Mint. Last minute changes may occur.
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationCMPSC 311- Introduction to Systems Programming Module: Concurrency
CMPSC 311- Introduction to Systems Programming Module: Concurrency Professor Patrick McDaniel Fall 2016 Sequential Programming Processing a network connection as it arrives and fulfilling the exchange
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationParallel Computing. Lecture 19: CUDA - I
CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationA Thread is an independent stream of instructions that can be schedule to run as such by the OS. Think of a thread as a procedure that runs
A Thread is an independent stream of instructions that can be schedule to run as such by the OS. Think of a thread as a procedure that runs independently from its main program. Multi-threaded programs
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Memory Issues in CUDA Execution Scheduling in CUDA February 23, 2012 Dan Negrut, 2012 ME964 UW-Madison Computers are useless. They can only
More informationCOSC 462 Parallel Programming
November 22, 2017 1/12 COSC 462 Parallel Programming CUDA Beyond Basics Piotr Luszczek Mixing Blocks and Threads int N = 100, SN = N * sizeof(double); global void sum(double *a, double *b, double *c) {
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationConcurrency. Johan Montelius KTH
Concurrency Johan Montelius KTH 2017 1 / 32 What is concurrency? 2 / 32 What is concurrency? Concurrency: (the illusion of) happening at the same time. 2 / 32 What is concurrency? Concurrency: (the illusion
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationLECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016
LECTURE ON PASCAL GPU ARCHITECTURE Jiri Kraus, November 14 th 2016 ACCELERATED COMPUTING CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 2 ACCELERATED COMPUTING CPU Optimized
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationWhat is concurrency? Concurrency. What is parallelism? concurrency vs parallelism. Concurrency: (the illusion of) happening at the same time.
What is concurrency? Concurrency Johan Montelius KTH 2017 Concurrency: (the illusion of) happening at the same time. A property of the programing model. Why would we want to do things concurrently? What
More informationAtomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.
Atomic Operations Atomic operations, fast reduction GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University ATOMIC OPERATIONS
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationCUDA. Sathish Vadhiyar High Performance Computing
CUDA Sathish Vadhiyar High Performance Computing Hierarchical Parallelism Parallel computations arranged as grids One grid executes after another Grid consists of blocks Blocks assigned to SM. A single
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationRegister file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.
Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)
More informationWrite only as much as necessary. Be brief!
1 CIS371 Computer Organization and Design Final Exam Prof. Martin Wednesday, May 2nd, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached (with
More informationAdvanced and parallel architectures. Part B. Prof. A. Massini. June 13, Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points)
Advanced and parallel architectures Prof. A. Massini June 13, 2017 Part B Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points) Student s Name Exercise 3 (4 points) Exercise 4 (3 points)
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationHPCSE - I. «Introduction to multithreading» Panos Hadjidoukas
HPCSE - I «Introduction to multithreading» Panos Hadjidoukas 1 Processes and Threads POSIX Threads API Outline Thread management Synchronization with mutexes Deadlock and thread safety 2 Terminology -
More informationCS 179: GPU Programming. Lecture 7
CS 179: GPU Programming Lecture 7 Week 3 Goals: More involved GPU-accelerable algorithms Relevant hardware quirks CUDA libraries Outline GPU-accelerated: Reduction Prefix sum Stream compaction Sorting(quicksort)
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationLecture 5. Performance Programming with CUDA
Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy
More informationCMPSC 311- Introduction to Systems Programming Module: Concurrency
CMPSC 311- Introduction to Systems Programming Module: Concurrency Professor Patrick McDaniel Fall 2013 Sequential Programming Processing a network connection as it arrives and fulfilling the exchange
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationGlobal Memory Access Pattern and Control Flow
Optimization Strategies Global Memory Access Pattern and Control Flow Objectives Optimization Strategies Global Memory Access Pattern (Coalescing) Control Flow (Divergent branch) Global l Memory Access
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationA Programmer s View of Shared and Distributed Memory Architectures
A Programmer s View of Shared and Distributed Memory Architectures Overview Shared-memory Architecture: chip has some number of cores (e.g., Intel Skylake has up to 18 cores depending on the model) with
More information