Review of previous examinations TMA4280 Introduction to Supercomputing

Similar documents
Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallel Computing Why & How?

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

High Performance Computing Systems

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Designing for Performance. Patrick Happ Raul Feitosa

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Introduction to Parallel Programming

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Parallel Programming Programowanie równoległe

Unit 9 : Fundamentals of Parallel Processing

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 7: Parallel Processing

6.1 Multiprocessor Computing Environment

Introduction to Parallel Programming

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Introduction to parallel Computing

UNIVERSITY OF MORATUWA

ECE 574 Cluster Computing Lecture 13

Lecture 7: Parallel Processing

Introduction to parallel computing concepts and technics

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

UNIVERSITY OF MORATUWA CS2052 COMPUTER ARCHITECTURE. Time allowed: 2 Hours 10 min December 2018

10th August Part One: Introduction to Parallel Computing

CSL 860: Modern Parallel

CS6303-COMPUTER ARCHITECTURE UNIT I OVERVIEW AND INSTRUCTIONS PART A

Parallelism. CS6787 Lecture 8 Fall 2017

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Martin Kruliš, v

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

Parallel Numerics, WT 2013/ Introduction

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST


Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

Parallel Computing Introduction

ECE 695 Numerical Simulations Lecture 3: Practical Assessment of Code Performance. Prof. Peter Bermel January 13, 2017

Parallel Numerics. 1 Data Dependency Graphs & DAGs. Exercise 3: Vector/Vector Operations & P2P Communication II

The determination of the correct

The Art of Parallel Processing

TDT 4260 lecture 3 spring semester 2015

Performance analysis basics

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

High-Performance and Parallel Computing

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

PIPELINE AND VECTOR PROCESSING

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Computer System Architecture Final Examination Spring 2002

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

Write only as much as necessary. Be brief!

Lecture 13: March 25

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

CUDA GPGPU Workshop 2012

Parallel Computing and the MPI environment

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

Parallelization Principles. Sathish Vadhiyar

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Page 1. Multilevel Memories (Improving performance using a little cash )

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE

Computer Organization and Architecture (CSCI-365) Sample Final Exam

Thinking parallel. Decomposition. Thinking parallel. COMP528 Ways of exploiting parallelism, or thinking parallel

Introduction to Parallel Computing

Laplace Exercise Solution Review

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Computer Architecture

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Basics of Performance Engineering

Intel Performance Libraries

Kernel Synchronization I. Changwoo Min

EZTrace upcoming features

Parallel Programming with MPI and OpenMP

Distributed Systems CS /640

Parallel Programming

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Transcription:

Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1

Examination The examination is usually comprised of: one problem related to linear algebra operations with calculation of complexity and parallelism. one problem related to parallelization of solver for partial differential equations. a series of short questions covering all the topics of the course. to be solved in four hours. 2

Important points The course covered a broad range of topics, but a few points are important: Differentiate between the different types of parallelism (data, functional/task, and fine-/coarse-grained). Be able to categorize hardware architectures, e.g with Flynn s taxonomy and understand their benefits/drawbacks. Understand the memory hierarchy of computers and how it affects the performance of solvers. Be able to define clearly and use the distributed and the shared-memory model. Have a basic knowledge of the different specifications (OpenMP, MPI, BLAS) and libraries (LAPACK, PETSc) discussed. 3

Questions: Parallel computing Question (2016) A good strategy for reducing communication overhead is to increase the number of processes. Answer: true or false It is crucial to remember Amdhal s Law, and for instance the practical effect observed in your project when studying the strong scaling. Question (2015) A code with a large parallel effciency typically has much network traffic. The same arguments apply. Question (2015) A code with a large parallel speedup has a large parallel effciency. The same arguments apply. 4

Questions: Computer architecture Question (2016) The following loops are compiled with an optimizing compiler: Which of them will likely have the highest performance (more FLOPS)? The following loops are compiled with an optimizing compiler. Which of them will likely have the highest performance (more FLOPS)? Answer: A, B or none. Briefly explain why. A: for ( int i = 0; i < N ; i ++) a[i] = a[i] + b [i ]; B: for ( int i = 0; i < N ; i ++) a[i] = a[i] + c * b [ i ]; Remember the definition of FLOPS and the capabilities of processor when it comes to floating-point operations. 5

Questions: Computer architecture Question (2015) A ccnuma machine can always do multiple additions in parallel. Answer: true or false Remember Flynn s taxonomy and the discussion on different computer architectures. Question (2015) An LFU cache replacement policy is typically the best for solving partial differential equations The question was briefly discuss in the description of different caching policies: First-In First-Out (FIFO), Least-Frequent Used (LFU), Least-Recently Used (LRU) and the pseudo-lru. Even without remembering the details of the implementation, you can think about the memory access pattern. 6

Questions: Computer architecture Question (2015) A SIMD processor can perform a multiplication and an addition simultanously. Answer: true or false. Remember Flynn s taxonomy to motivate your answer. Question (2015) Cancellation is a concern when subtracting floating point numbers. Answer: true or false. Remember the binary representation of floating-point numbers and how it affects the decimal precision. Question (2014) Floating point numbers of a given precision can only represent a fixed range of numbers. Remember the binary representation of floating-point numbers. 7

Questions: Computer architecture Question (2014) A modern processor typically has a cache hierarchy. These are designed as levels, where a higher level is given to faster memory. Remember for instance the architecture of modern CPUs and the different memory levels. Question (2014) A superscalar processor can perform two additions simultanously. Remember the discussion about pipelining. 8

Questions: Distributed memory Question (2016) It is always OK to call MPI library functions from different threads. Answer: true or false. It is never OK to call MPI library functions from different threads. Answer: true or false. M P I _ T H R E A D _ S I N G L E Only one thread will execute. M P I _ T H R E A D _ F U N N E L E D The process may be multi - threaded, but only the main thread will make MPI calls ( all MPI calls are funneled to the main thread ). M P I _ T H R E A D _ S E R I A L I Z E D The process may be multi - threaded, and multiple threads may make MPI cal M P I _ T H R E A D _ M U L T I P L E Multiple threads may call MPI, with no restrictions. In practice MPI_THREAD_SINGLE only is portable, see Balaji (2010) for a discussion. 9

Questions: Distributed memory Question (2011) Consider the MPI-function below: the amount of data sent corresponds to 128 floating point numbers in double precision. Answer: true or false. MPI_Send ( buffer, 1024, MPI_DOUBLE, dest, tag, MPI_COMM_WORLD ); Writing the code for the projects should make this question easy. Question (2011) The most efficient implementation of the MPI_Allreduce operation on P processors completes in a certain number of communication stages. Which of the 4 alternatives is correct: (i) 1 stage; (ii) log 2 (P ) stages; (iii) 2 log 2 (P ) stages; (iv) P stages Remember your implementation in Project I. 10

Questions: Distributed memory Question (2011) The functions MPI_Send and MPI_Recv are appropriate to use for the exchange of an interprocess boundary. Is it possible to experience "deadlock" when using these functions? Answer: yes or no. Knowledge of point-to-point communication is sufficient. 11

Questions: Shared-memory model Question (2016) Since threads do not need to communicate (like processes do), there is no penalty incurred by using more of them. Answer: true or false. Remember how OpenMP works and the difference between concurrency and parallelism. Question (2014) OpenMP is usable from all programming languages. Just enough to know about the OpenMP specification. 12

Questions: Numerical Linear Algebra Question (2016) BLAS is a high performance library for solving linear systems of equations. Answer: true or false. Remember how the specification was introduced during the lecture. Hint: different level of algebric operations. Question (2015) In code utilizing dense linear algebra, you have to choose between using BLAS or LAPACK. Answer: true or false. Similar question, the relation betweeen BLAS and LAPACK is important to remember. 13

Questions: Numerical Linear Algebra Question (2015) Using PETSc there is no point in pre-declaring the sparsity pattern of your operator. Answer: true or false. Review how the storage for a sparse matrix is handled and how it may affect the performance. Question (2014) You typically get performance closer to the theoretical peak performance of a machine when you do level 1 operations (vector operations) compared to level 3 (matrix-matrix operations). Think about the ratio between operations and data movement. 14

Questions: Parallel Input/Output Question (2015) MPI-I/O is often used to do post-mortem data assembly. Answer: true or false. The exact term as not been mentioned during the lecture but remember how the MPI-I/O implementation works. Question (2014) MPI-I/O always writes multi-dimensional arrays in Fortran order. Just remember the purpose of MPI-I/O. 15

Notions used in Problems Study of parallel performance of algorithms. Amdahl s law: speed-up on p processor w.r.t serial S p = T 1 p = T p f (p 1) + 1 with f fraction of time spent in serial sections of the code. The fraction f : 1 for purely serial case 0 for ideal parallel case Linear strong scaling: f = 0 S p = T 1 T p = p ideal speed-up when solving a fixed problem on p processors. 16

Notions used in Problems Some notions are important and will be used during the examination. Linear model for transmission: (1) τ c (k) = τ s + γk with τ s a constant that is the start-up phase and γ the inverse bandwith. Amdahl s law: speed-up on p processor w.r.t serial S p = p p T c T 1 + 1 with T c the time spent in communications, a function of τ c. 17

Challenges of parallel computing S P Ideally Realistically Figure: Ideal speedup (S P = P) and realistic speedup. P strong scaling should be interpreted carefully! weak scaling should also be considered. 18