Cilk programs as a DAG
|
|
- Dwayne Gilmore
- 5 years ago
- Views:
Transcription
1 Cilk programs as a DAG The pattern of spawn and sync commands defines a graph The graph contains dependencies between different functions spawn command creates a new task with an out-bound link sync command creates inbound link from spawned tasks cilk int Fib(n=3) { if(n<2) return n; } int x=spawn Fib(n-1); int y=spawn Fib(n-2); sync; return x+y; cilk int Fib(n=2) { if(n<2) return n; int x=spawn Fib(n-1); int y=spawn Fib(n-2); sync; return x+y; } cilk int Fib(n=1) { if(n<2) return n; int x=spawn Fib(n-1); int y=spawn Fib(n-2); sync; return x+y; } cilk int Fib(n=1) { if(n<2) return n;... } cilk int Fib(n=0) { if(n<2) return n;... } HPCE / dt10 / 2014 / 16.1
2 cilk int Fib(int n) { if(n<2) return n; } int x=spawn Fib(n-1); int y=spawn Fib(n-2); sync; return x+y; HPCE / dt10 / 2014 / 16.2
3 Steps within a function execution sequentially Independent functions may execute in parallel HPCE / dt10 / 2014 / 16.3
4 Total Work : T 1 Critical path : T - total time required to execute all tasks - longest path through all tasks assume each step takes unit time : total work = 35; critical path = 16 HPCE / dt10 / 2014 / 16.4
5 Best case and worst-case times Define three times: T 1, T P, T T 1 : Time to execute on one processor (Total Work) T P : Time to execute on P processors T : Time to execute on infinite processors (Critical Path) T 1 / T P : Speedup with P processors Can establish an ordering on the times T 1 / P T P - Maximum speedup with P processors is P T P T - Finite processors are no faster than infinite Can talk about scalability if T 1 / T P = O(P) then Linear speedup (perfect scaling) We always want linear speedup can we achieve it? HPCE / dt10 / 2014 / 16.5
6 Greedy Schedulers A Greedy Scheduler executes work using an ASAP approach Each time step launch all tasks with no dependencies The notion of a time-step is deliberately context dependent When executing with P processors we have two types of step complete step : There are P or more tasks ready to execute incomplete step : There are less than P tasks ready to execute A greedy scheduler always achieves T P T 1 / P + T Best case is easy to visualise we do all work in T P complete steps Worst case is a bit more difficult Steps on critical path execute in incomplete steps Last step on critical path frees up all remaining work for complete steps HPCE / dt10 / 2014 / 16.6
7 HPCE / dt10 / 2014 / 16.7
8 Linear Scaling and Greedy Schedulers Previous equations assume zero-cost scheduling Some overhead involved in tracking tasks that can be run Some overhead in scheduling ready tasks to a processor Define critical overhead : c Smallest c such that T P T 1 / P + c T Covers the cost of tracking dependencies on critical path Linear scaling if there is usually much more work than CPUs Average parallelism : P = T 1 / T Assumption of parallel slackness : P / P >> c Therefore: T 1 / P >> c T And so: T P T 1 / P (linear speedup) Assumption of parallel slackness implies linear speedup HPCE / dt10 / 2014 / 16.8
9 Is that a reasonable assumption? Central idea is that most steps are complete All processors are occupied most of the time Does computation look like that? Recall Gustafson s law and the finite-difference example T 1 = O(n 2 ); T = O(n) P = T 1 / T = O(n) Assuming c is not too high we should get linear scaling For lots of stuff the assumption is broadly true HPCE / dt10 / 2014 / 16.9
10 Work-first rule Define work overhead : c 1 = T 1 / T S T S : Time to run serial version of program (serial elision) Cost of dynamic scheduling vs static scheduling on one CPU What is the importance of c 1 vs c? Substitute into previous defn (T P T 1 / P + c T ) T P c 1 T s / P + c T Now re-introduce assumption of parallel slackness (P / P >> c ) T 1 / (T P) >> c T 1 / P >> c T c 1 T S / P >> c T Therefore: T P c 1 T s / P Work-first rule: minimise c 1 rather than c HPCE / dt10 / 2014 / 16.10
11 Total Work : T 1 Serial Work : T S - total time required for Cilk on one processor (red+green) - total time required for serial-elisions (green only) assume each step takes unit time : total work = 35; serial work = 22 HPCE / dt10 / 2014 / 16.11
12 Interpreting the work-first rule The work-first rule appears in many guises What are c 1 and c in practise? Multi-core CPUs and OSs support traditional threads c 1 : How much time to swap between two threads on a CPU? c : How much time to create a new thread? GPUs support hundreds of parallel threads c 1 : Nano-second scheduling of threads in a kernel c : Milli-second cost to manage kernels from the CPU Intel TBB supports thousands of tasks c 1 : Agglomeration of loop iterations to reduce overheads c : Hierarchical task based scheduler (based on Cilk) Bear this principle in mind when looking at real systems HPCE / dt10 / 2014 / 16.12
13 Work-first has permeated everything Vectorisation: size of vector versus cost of operation Pipe processing: size of buffer versus cost of call FFT: size of parallel loop versus cost of spawning task Heat: cost of memory access versus bit-wise accesses (bit more tenuous, but still the same principle) Open/Close: size of parallel batch versus latency cost Does the assumption of average parallelism hold? Bitecoin:? HPCE / dt10 / 2014 / 16.13
14 Administrivia: CW6 A number of requests for coursework extensions I reluctantly agree to the possibility But only within the context of the exercise Some people already have sunk cost based on original timing Proposed amendment Friday 21st, 23:59. Coin weight 1. (Same) Sunday 23rd, 23:59. Coin weight 2. (Was Saturday) Friday 24th, 23:59. Coin weight 3. (Was Monday) HPCE / dt10 / 2014 / 16.14
15 Coursework 5 debrief Large diversity of solutions, trading off various concepts Currently looking at them as they get ready to compile Also doing final tests on CW % TBB; 27% OpenCL+TBB; 20% OpenCL; other Lots of approaches to solving the problem Original loop order in chunks Sliding diamonds Clever techniques for handing border overlap I m reluctant to give my solution: implies it is the correct one My plan is to collect together the most interesting ones Write up (with permission) and possibly do a short debrief HPCE / dt10 / 2014 / 16.15
16 Reflections on the course: good Quality of practical skills learnt is vastly better Some of the CW5 implementations are very sophisticated Assessment is more authentic Previous assessments too constrained and artificial IO has been considered rather than just ignored All data comes from somewhere and goes somewhere The need to test has (mostly) been integrated Previous two years people did not test their code, and it showed HPCE / dt10 / 2014 / 16.16
17 Reflections on the course : less good Feedback: still way too slow for the first four courseworks Not as slow as last year I need to get out of the way and let GTAs help You will still get all the feedback for all the assessments Not using the technology available I set up a message-board, then didn t realise it was invisible Lack of a clear interaction point: webpage, git, blackboard (Minor) No project management: git, collaborative work Had to strip out when it became clear there wasn t time HPCE / dt10 / 2014 / 16.17
18 Ideas for next year Front-load the course more Schedule two lecture + 1 practical in the first half of term Use the technology better I was acting as a conduit: not scalable, and not helpful Collaborative tools exist and work well Improve feedback timing Now have more experience of high throughput marking Can now build more robust marking system for early coursework HPCE / dt10 / 2014 / 16.18
19 And that s it (Apart from Orals) HPCE / dt10 / 2014 / 16.19
High Performance Computing for Engineers
High Performance Computing for Engineers David Thomas dt10@ic.ac.uk Room 903 HPCE / dt10/ 2014 / 0.1 High Performance Computing for Engineers Research Testing communication protocols Evaluating signal-processing
More informationMulticore programming in CilkPlus
Multicore programming in CilkPlus Marc Moreno Maza University of Western Ontario, Canada CS3350 March 16, 2015 CilkPlus From Cilk to Cilk++ and Cilk Plus Cilk has been developed since 1994 at the MIT Laboratory
More informationCSE 260 Lecture 19. Parallel Programming Languages
CSE 260 Lecture 19 Parallel Programming Languages Announcements Thursday s office hours are cancelled Office hours on Weds 2p to 4pm Jing will hold OH, too, see Moodle Scott B. Baden /CSE 260/ Winter 2014
More informationThe Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.
Cilk Plus The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spinoff,
More informationMultithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa
CS4460 Advanced d Algorithms Batch 08, L4S2 Lecture 11 Multithreaded Algorithms Part 1 N. H. N. D. de Silva Dept. of Computer Science & Eng University of Moratuwa Announcements Last topic discussed is
More informationMultithreaded Parallelism and Performance Measures
Multithreaded Parallelism and Performance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101
More informationCilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle
CS528 Slides are adopted from http://supertech.csail.mit.edu/cilk/ Charles E. Leiserson A Sahu Dept of CSE, IIT Guwahati HPC Flow Plan: Before MID Processor + Super scalar+ Vector Unit Serial C/C++ Coding
More informationIntroduction to Multithreaded Algorithms
Introduction to Multithreaded Algorithms CCOM5050: Design and Analysis of Algorithms Chapter VII Selected Topics T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to algorithms, 3 rd
More informationWriting Parallel Programs; Cost Model.
CSE341T 08/30/2017 Lecture 2 Writing Parallel Programs; Cost Model. Due to physical and economical constraints, a typical machine we can buy now has 4 to 8 computing cores, and soon this number will be
More informationPlan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice
lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5
More informationPlan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice
lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationCS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #21: Caches 3 2005-07-27 CS61C L22 Caches III (1) Andy Carle Review: Why We Use Caches 1000 Performance 100 10 1 1980 1981 1982 1983
More informationIntroduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras
Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras Week - 05 Lecture - 21 Scheduling in Linux (O(n) and O(1) Scheduler)
More informationNETW3005 Operating Systems Lecture 1: Introduction and history of O/Ss
NETW3005 Operating Systems Lecture 1: Introduction and history of O/Ss General The Computer Architecture section SFDV2005 is now complete, and today we begin on NETW3005 Operating Systems. Lecturers: Give
More informationInput and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state
What is computation? Input and Output = Communication Input State Output i s F(s,i) (s,o) o s There are many different types of IO (Input/Output) What constitutes IO is context dependent Obvious forms
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationCPU Architecture. HPCE / dt10 / 2013 / 10.1
Architecture HPCE / dt10 / 2013 / 10.1 What is computation? Input i o State s F(s,i) (s,o) s Output HPCE / dt10 / 2013 / 10.2 Input and Output = Communication There are many different types of IO (Input/Output)
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming
More information27. Parallel Programming I
771 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling
More information27. Parallel Programming I
760 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationInterconnect Technology and Computational Speed
Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming
More informationCILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY
CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY 1 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 2 IDEALIZED SHARED MEMORY ARCHITECTURE Hardware
More informationHeterogeneous-Race-Free Memory Models
Heterogeneous-Race-Free Memory Models Jyh-Jing (JJ) Hwang, Yiren (Max) Lu 02/28/2017 1 Outline 1. Background 2. HRF-direct 3. HRF-indirect 4. Experiments 2 Data Race Condition op1 op2 write read 3 Sequential
More informationParallelism and Performance
6.172 erformance Engineering of Software Systems LECTURE 13 arallelism and erformance Charles E. Leiserson October 26, 2010 2010 Charles E. Leiserson 1 Amdahl s Law If 50% of your application is parallel
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2016 Lecture 32 Virtual Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Questions for you What is
More informationAn Overview of Parallel Computing
An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA
More informationCS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME
CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME 1 Last time... GPU Memory System Different kinds of memory pools, caches, etc Different optimization techniques 2 Warp Schedulers
More informationParallelism and Concurrency. COS 326 David Walker Princeton University
Parallelism and Concurrency COS 326 David Walker Princeton University Parallelism What is it? Today's technology trends. How can we take advantage of it? Why is it so much harder to program? Some preliminary
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationChapter 3: Processes. Operating System Concepts 8th Edition,
Chapter 3: Processes, Administrivia Friday: lab day. For Monday: Read Chapter 4. Written assignment due Wednesday, Feb. 25 see web site. 3.2 Outline What is a process? How is a process represented? Process
More information27. Parallel Programming I
The Free Lunch 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism,
More informationMultithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson
Multithreaded Programming in Cilk LECTURE 1 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
More informationCS 240A: Shared Memory & Multicore Programming with Cilk++
CS 240A: Shared Memory & Multicore rogramming with Cilk++ Multicore and NUMA architectures Multithreaded rogramming Cilk++ as a concurrency platform Work and Span Thanks to Charles E. Leiserson for some
More informationEffective Performance Measurement and Analysis of Multithreaded Applications
Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined
More informationCilk, Matrix Multiplication, and Sorting
6.895 Theory of Parallel Systems Lecture 2 Lecturer: Charles Leiserson Cilk, Matrix Multiplication, and Sorting Lecture Summary 1. Parallel Processing With Cilk This section provides a brief introduction
More informationPlan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice
lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 02 - CS 9535 arallelism Complexity Measures 2 cilk for Loops 3 Measuring
More informationCS CS9535: An Overview of Parallel Computing
CS4403 - CS9535: An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) January 10, 2017 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms:
More informationToday: Amortized Analysis (examples) Multithreaded Algs.
Today: Amortized Analysis (examples) Multithreaded Algs. COSC 581, Algorithms March 11, 2014 Many of these slides are adapted from several online sources Reading Assignments Today s class: Chapter 17 (Amortized
More informationThe DAG Model; Analysis of For-Loops; Reduction
CSE341T 09/06/2017 Lecture 3 The DAG Model; Analysis of For-Loops; Reduction We will now formalize the DAG model. We will also see how parallel for loops are implemented and what are reductions. 1 The
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationThreads: either under- or over-utilised
Threads: either under- or over-utilised Underutilised: limited by creation speed of work Cannot exploit all the CPUs even though there is more work Overutilised: losing performance due to context switches
More informationEECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)
EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.) Warning: Crazy times coming Project handout and group formation today Help me to end class 12 minutes early P3
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 L20 Virtual Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Questions from last time Page
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationIntroduction. CS3026 Operating Systems Lecture 01
Introduction CS3026 Operating Systems Lecture 01 One or more CPUs Device controllers (I/O modules) Memory Bus Operating system? Computer System What is an Operating System An Operating System is a program
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More informationCS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1
CS4961 Parallel Programming Lecture 4: Data and Task Parallelism Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following command:
More informationPROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8
PROCESSES AND THREADS THREADING MODELS CS124 Operating Systems Winter 2016-2017, Lecture 8 2 Processes and Threads As previously described, processes have one sequential thread of execution Increasingly,
More informationCompsci 590.3: Introduction to Parallel Computing
Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationHomework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization
ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor
More informationA Static Cut-off for Task Parallel Programs
A Static Cut-off for Task Parallel Programs Shintaro Iwasaki, Kenjiro Taura Graduate School of Information Science and Technology The University of Tokyo September 12, 2016 @ PACT '16 1 Short Summary We
More informationA Distributed Hash Table for Shared Memory
A Distributed Hash Table for Shared Memory Wytse Oortwijn Formal Methods and Tools, University of Twente August 31, 2015 Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash
More informationThe Art and Science of Memory Allocation
Logical Diagram The Art and Science of Memory Allocation Don Porter CSE 506 Binary Formats RCU Memory Management Memory Allocators CPU Scheduler User System Calls Kernel Today s Lecture File System Networking
More informationOperating Systems (2INC0) 2017/18
Operating Systems (2INC0) 2017/18 Memory Management (09) Dr. Courtesy of Dr. I. Radovanovic, Dr. R. Mak (figures from Bic & Shaw) System Architecture and Networking Group Agenda Reminder: OS & resources
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationThe PRAM (Parallel Random Access Memory) model. All processors operate synchronously under the control of a common CPU.
The PRAM (Parallel Random Access Memory) model All processors operate synchronously under the control of a common CPU. The PRAM (Parallel Random Access Memory) model All processors operate synchronously
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationHomework # 1 Due: Feb 23. Multicore Programming: An Introduction
C O N D I T I O N S C O N D I T I O N S Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.86: Parallel Computing Spring 21, Agarwal Handout #5 Homework #
More informationIX: A Protected Dataplane Operating System for High Throughput and Low Latency
IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this
More informationCode optimization in a 3D diffusion model
Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationPlan of the lecture. Quick-Sort. Partition of lists (or using extra workspace) Quick-Sort ( 10.2) Quick-Sort Tree. Partitioning arrays
Plan of the lecture Quick-sort Lower bounds on comparison sorting Correctness of programs (loop invariants) Quick-Sort 7 4 9 6 2 2 4 6 7 9 4 2 2 4 7 9 7 9 2 2 9 9 Lecture 16 1 Lecture 16 2 Quick-Sort (
More informationCOSC243 Part 2: Operating Systems
COSC243 Part 2: Operating Systems Lecture 14: Introduction, and history of operating systems Zhiyi Huang Dept. of Computer Science, University of Otago Zhiyi Huang (Otago) COSC243 Lecture 14 1 / 27 General
More informationHeterogeneous platforms
Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationLesson 1 1 Introduction
Lesson 1 1 Introduction The Multithreaded DAG Model DAG = Directed Acyclic Graph : a collection of vertices and directed edges (lines with arrows). Each edge connects two vertices. The final result of
More informationCS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures CS61C L22 Caches II (1) CPS today! Lecture #22 Caches II 2005-11-16 There is one handout today at the front and back of the room! Lecturer PSOE,
More informationOrder Is A Lie. Are you sure you know how your code runs?
Order Is A Lie Are you sure you know how your code runs? Order in code is not respected by Compilers Processors (out-of-order execution) SMP Cache Management Understanding execution order in a multithreaded
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationCilk Plus: Multicore extensions for C and C++
Cilk Plus: Multicore extensions for C and C++ Matteo Frigo 1 June 6, 2011 1 Some slides courtesy of Prof. Charles E. Leiserson of MIT. Intel R Cilk TM Plus What is it? C/C++ language extensions supporting
More informationOutline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems
CSC 447: Parallel Programming for Multi- Core and Cluster Systems Performance Analysis Instructor: Haidar M. Harmanani Spring 2018 Outline Performance scalability Analytical performance measures Amdahl
More informationOutlook. Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium
Main Memory Outlook Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium 2 Backgound Background So far we considered how to share
More informationA Primer on Scheduling Fork-Join Parallelism with Work Stealing
Doc. No.: N3872 Date: 2014-01-15 Reply to: Arch Robison A Primer on Scheduling Fork-Join Parallelism with Work Stealing This paper is a primer, not a proposal, on some issues related to implementing fork-join
More informationFYS Data acquisition & control. Introduction. Spring 2018 Lecture #1. Reading: RWI (Real World Instrumentation) Chapter 1.
FYS3240-4240 Data acquisition & control Introduction Spring 2018 Lecture #1 Reading: RWI (Real World Instrumentation) Chapter 1. Bekkeng 14.01.2018 Topics Instrumentation: Data acquisition and control
More informationDatenstrukturen und Algorithmen
1 Datenstrukturen und Algorithmen Exercise 12 FS 2018 Program of today 2 1 Feedback of last exercise 2 Repetition theory 3 Programming Tasks 1. Feedback of last exercise 3 Football Championship 4 Club
More informationProcess. One or more threads of execution Resources required for execution. Memory (RAM) Others
Memory Management 1 Learning Outcomes Appreciate the need for memory management in operating systems, understand the limits of fixed memory allocation schemes. Understand fragmentation in dynamic memory
More informationAnnouncements/Reminders
Announcements/Reminders Class news group: rcfnews.cs.umass.edu::cmpsci.edlab.cs377 CMPSCI 377: Operating Systems Lecture 5, Page 1 Last Class: Processes A process is the unit of execution. Processes are
More informationHigh Performance Computing Systems
High Performance Computing Systems Shared Memory Doug Shook Shared Memory Bottlenecks Trips to memory Cache coherence 2 Why Multicore? Shared memory systems used to be purely the domain of HPC... What
More informationParallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication
CSE 539 01/22/2015 Parallel Algorithms Lecture 3 Scribe: Angelina Lee Outline of this lecture: 1. Implementation of cilk for 2. Parallel Matrix Multiplication 1 Implementation of cilk for We mentioned
More informationCW1-CW4 Progress. All the marking scripts for CW1-CW4 were done over summer Just haven t had time to sanity check the output
CW1-CW4 Progress Best efforts are going to pot All the marking scripts for CW1-CW4 were done over summer Just haven t had time to sanity check the output CW4 : You should all get a private repo Encountered
More informationKAAPI : Adaptive Runtime System for Parallel Computing
KAAPI : Adaptive Runtime System for Parallel Computing Thierry Gautier, thierry.gautier@inrialpes.fr Bruno Raffin, bruno.raffin@inrialpes.fr, INRIA Grenoble Rhône-Alpes Moais Project http://moais.imag.fr
More informationPractice Problems for the Final
ECE-250 Algorithms and Data Structures (Winter 2012) Practice Problems for the Final Disclaimer: Please do keep in mind that this problem set does not reflect the exact topics or the fractions of each
More informationCSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs. Ruth Anderson Autumn 2018
CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs Ruth Anderson Autumn 2018 Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer
More informationReversing. Time to get with the program
Reversing Time to get with the program This guide is a brief introduction to C, Assembly Language, and Python that will be helpful for solving Reversing challenges. Writing a C Program C is one of the
More informationAlgorithms and Data Structures
Algorithms and Data Structures or, Classical Algorithms of the 50s, 60s, 70s Richard Mayr Slides adapted from Mary Cryan (2015/16) with small changes. School of Informatics University of Edinburgh ADS
More informationProgramming Parallel Computers
ICS-E4020 Programming Parallel Computers Jukka Suomela Jaakko Lehtinen Samuli Laine Aalto University Spring 2016 users.ics.aalto.fi/suomela/ppc-2016/ New code must be parallel! otherwise a computer from
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationEngineering Robust Server Software
Engineering Robust Server Software Scalability Intro To Scalability What does scalability mean? 2 Intro To Scalability 100 Design 1 Design 2 Latency (usec) 75 50 25 0 1 2 4 8 16 What does scalability mean?
More informationIntroduction to Parallel Programming For Real-Time Graphics (CPU + GPU)
Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU) Aaron Lefohn, Intel / University of Washington Mike Houston, AMD / Stanford 1 What s In This Talk? Overview of parallel programming
More informationCOT 4600 Operating Systems Fall Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM
COT 4600 Operating Systems Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM Lecture 23 Attention: project phase 4 due Tuesday November 24 Final exam Thursday December 10 4-6:50
More informationAccelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration
Accelerating Applications the art of maximum performance computing James Spooner Maxeler VP of Acceleration Introduction The Process The Tools Case Studies Summary What do we mean by acceleration? How
More information