Design of a Virtualization Framework to Enable GPU Sharing in Cluster Environments

Size: px
Start display at page:

Download "Design of a Virtualization Framework to Enable GPU Sharing in Cluster Environments"

Transcription

1 Design of a Virtualization Framework to Enable GPU Sharing in Cluster Environments Michela Becchi University of Missouri nps.missouri.edu

2 GPUs in Clusters & Clouds Many-core GPUs are used in supercomputers 3 out of the top 10 supercomputers use GPUs Titan: > 20 petaflops, > 700 terabytes memory 18,688 nodes: 16-core AMD CPU 1 Nvidia Tesla K20 GPU Many-core GPUs are used in cloud computing 2

3 Different usage paradigms Accelerator model Cluster/cloud model 1 application Multi-tenancy GPU: dedicated resource Explicit procurement of GPUs Static (or programmer-defined) binding of application to GPUs Intra-application scheduling GPU: shared resource Resource virtualization & Transparency Dynamic (or runtime) binding of applications to GPUs better resource utilization and load balancing Intra- and Inter-application scheduling Memory management within application Advanced memory management across applications required 3

4 Context AMBER GROMACS NAMD GPUBlast LAMMPS NAMD AMBER Blast

5 We have designed a runtime that Abstracts GPUs from end-users Schedules applications on GPUs Dynamically binds applications to GPUs Allows GPU sharing Provides memory management Provides dynamic recovery and load balancing in case of GPU failure/upgrade/downgrade 5

6 Deployment scenarios With cluster-level schedulers E.g.: TORQUE, SLURM Cluster-level scheduler CUDA app CUDA app CUDA app CUDA app CUDA app CUDA app Intercept library Intercept library Intercept library Our RUNTIME OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n Our RUNTIME OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n Our RUNTIME OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n With VM-based systems for cloud computing E.g.: Eucalyptus VM 1 VM 2 VM 3 VM 4 VM k VM n CUDA app Intercept library guest OS CUDA app Intercept library guest OS CUDA app Intercept library guest OS CUDA app Intercept library guest OS CUDA app Intercept library guest OS CUDA app Intercept library guest OS VM manager Our RUNTIME host OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n Our RUNTIME host OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n Our RUNTIME host OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n 6

7 GPU sharing Inter-kernel sharing [HPDC 11] - When: GPU underutilized within a kernel - Why: limited parallelism, small datasets - How: kernel consolidation across applications GPU k 1 k 1 k 1 k 1 k1 k 1 k 1 k 1 k 2 k 2 Inter-application sharing [HPDC 12] - When: GPU underutilized within an application - Why: long CPU phases - How: application multiplexing on GPU GPU CPU app 1 app 1 app 1 app 1 app 2 app 1 app 1 app 2 time time 7

8 GPU sharing Multi-process application sharing [HPDC 13] When: GPU underutilized by multi-process applications (e.g. MPI) Why: Synchronization leads to intra- & inter-application imbalance How: Preempt some inactive processes to allow other processes to progress GPU 0 GPU 1 A 0 A 1 time A 0 A 1 B 0 B 1 B 0 B 1 GPU 0 GPU 1 A 0 B 0 A 0 B 1 B 0 A 1 A 1 B 1 time 8

9 Inter-kernel sharing [HPDC 11] app 1 m c HD k 1 c DH f app 2 m c HD k 2 c DH f serialized execution m c HD k 1 c DH f m c HD k 2 c DH f Inter-kernel sharing m c HD m c HD combined k 1 & k 2 c DH f c DH f time app 1 and app 2 have no conflicting memory requirements 9

10 Relative Throughput Benefit Space- vs. time-sharing: some results SPACE-SHARING TIME-SHARING BATCH 1 BATCH 2 BATCH 3 BATCH 4 GPU1 GPU2 GPU2 GPU1 GPU1 GPU2 GPU GPU1 1.3 BS+KM BO+KNN PDE+MD EU+IP BS+BO KM+KNN BO+EU BS+MD Workload Mix 10

11 Idea: Molding Downgrade the execution configuration of kernels so to force beneficial sharing Penalize single application to improve overall throughput Limiting # blocks force space sharing Limiting # threads/block force time sharing w/ interleaved execution kernel 1 b 11 b 12 b 13 b 14 k1 Downgrade to 3 blocks kernel 2 b21 b22 b23 b14 k1 Downgrade to 2 blocks b 11 b 12 b 13 b 21 b 22 kernel 1 and kernel 2 can space-share GPU after molding 11

12 Relative Throughput Benefit Molding: some results FORCED TIME-SHARING No Molding Molding FORCED SPACE-SHARING GPU2 1.4 GPU1 GPU2 GPU IP+BS PDE+MD IP+BO BS+KM Worklaod Mix Molding can improve overall throughput despite penalizing single applications 12

13 Inter-application sharing [HPDC 12] app 1 m c HD k 11 cpu k 12 c DH f cpu app 2 cpu m c HD k 21 c DH f cpu m c HD k 22 cpu k 23 c DH f serialized execution GPU CPU sharing w/o conflicting memory req. GPU CPU sharing w/ conflicting memory req. GPU CPU GPU CPU xfer (app 1 ) CPU GPU xfer (app 1 ) time 13

14 Our runtime: node-level view app 1 app 2 app 3 app j app N Intercept Lib Intercept Lib Intercept Lib Intercept Lib Intercept Lib Our RUNTIME Memory manager Virtual memory handling Memory manager Page table Swap area connection manager and offload control dispatcher Waiting contexts Assigned contexts Failed contexts node-to-node offloading Dispatcher Scheduling GPU binary registration vgpu 11 vgpu 12 vgpu 1k vgpu 21 vgpu 22 vgpu 2k vgpu n1 vgpu n2 vgpu nk CUDA driver/runtime GPU 1 GPU 2 GPU n Virtual GPUs Abstraction GPU Sharing 14

15 Mapping and scheduling (FCFS) app 1 app 2 app 3 t 1 t 2 t 1 t 2 t 3 t 1 t 2 t 3 FE library FE library FE library Our RUNTIME Memory manager connection manager and offload control c 11 c 12 c 21 c 22 c 23 c 31 c 32 c 33 Waiting ctx Page table Swap area dispatcher Assigned ctx Failed ctx vgpu 11 vgpu 12 vgpu 21 vgpu 22 vgpu 31 vgpu 32 GPU 1 CUDA driver/runtime GPU 2 GPU 3 15

16 Mapping and scheduling (FCFS) app 1 app 2 app 3 t 1 t 2 t 1 t 2 t 3 t 1 t 2 t 3 FE library FE library FE library Our RUNTIME Memory manager Page table Swap area connection manager and offload control dispatcher c 11 c 12 c 21 c 22 c 23 c 31 c 32 c 33 c 11 c 12 c 21 c 22 c 23 c 31 c 32 Waiting ctx Assigned ctx Failed ctx Hardware configuration and application- GPU mapping abstracted from end-users complete! vgpu 11 vgpu 12 vgpu 21 vgpu 22 vgpu 31 vgpu 32 Time-sharing of GPUs GPU 1 CUDA driver/runtime GPU 2 GPU 3 16

17 Delayed binding app 1 malloc 1 copyhd 11 copyhd 12 kernel 1 copydh 1 Memory manager Page table app 1 app 2 app j app N malloc 2 copyhd 2 kernel 2 copydh 2 Swap area d1 connection manager and offload control c 1 dispatcher c 2 Deferral of application-gpu mapping o Better scheduling decisions o GPU memory allocation when needed app 2 d2 CUDA driver/runtime vgpu 11 vgpu 12 Memory manager in runtime o dynamic binding GPU 1 17

18 Dynamic binding & swapping app 1 app 2 app 3 app 4 Memory manager Page table app 1 malloc 1 copyhd 1 kernel 11 kernel 12 Swap area d1 malloc 2 copyhd 2 kernel 2 GPU 1 malloc 3 copyhd 3 kernel 3 dispatcher malloc 4 copyhd 4 kernel 4 app 2 d2 app 3 d3 c 1 c 3 c 4 app 4 d4 vgpu 11 vgpu 12 vgpu 21 vgpu 22 Swap! d1 connection manager and offload control c 2 CUDA driver/runtime Full! d3 d1 GPU 2 d4 GPU sharing among applications with conflicting memory requirements Migration of applications from slower to faster GPUs High availability in case of GPU failure Load balancing in case of GPU upgrade-downgrade 18

19 Total Execution time (sec) Experiments: sharing & swapping serialized execution (1 vgpu) GPU sharing (4 VGPUs) Fraction of CPU code 2 Tesla C2050 and 1 Tesla C1060 GPUs 36 matmul jobs w/ 5 kernel calls and varying CPU phases Sharing increases performances by hiding CPU phases 19

20 Execution Time (sec) Experiments: cluster w/ TORQUE serialized execution GPU sharing (4 vgpus) GPU sharing + load balancing Tot (16 jobs) Avg (16 jobs) Tot (32 jobs) Avg (32 jobs) Tot (48 jobs) Avg (48 jobs) Metric (# jobs) 3 node cluster w/ 2 GPU nodes (2 Tesla C2050s and 1 C1060) >2X performance improvement due to sharing, further 20% due to offloading 20

21 Load Imbalance in Multi-process applications [HPDC 13] Causes of load imbalance Intrinsic load-imbalance (intra-application) Different GPU capabilities (intra-application) Unmatched amount of GPUs to processes (inter-application) Synchronization among processes

22 Intra-application Imbalance

23 Inter-application Imbalance

24 Preemption Policies Maximum idle time-driven preemption Preempt a context (process) whenever it does not utilize the GPU for a predefined amount of time PROS: easy implementation CONS: need for setting maximum idle time parameter Synchronization call-driven preemption Preempt a context (process) whenever a collective communication or synchronization call is serviced PROS: no need for parameters setting CONS: Either need for bookkeeping (complex implementation, overhead) Or unnecessary preemptions (last process enters synchronization point)

25 Overall execution time (seconds) Experiment: Node-level % Batch scheduling 2-way sharing 4-way sharing Preemptive sharing Preemptive 2-way sharing % Percentage imbalance Intra-application imbalance Batch scheduler fails to capture intra-application imbalance N-way sharing hides CPU execution behind GPU execution phases of co-located processes Combining 2-way sharing and preemption further improves performance

26 Overall execution time (seconds) Experiment: Node-level Batch scheduling 2-way sharing 4-way sharing Preemptive sharing Preemptive 2-way sharing x[4] + 1x[3] 2x[4] + 2x[3] 1x[4] + 3x[3] 4x[3] Workload composition Inter-application imbalance Batch scheduler causes GPU underutilization, leading to performance loss N-way sharing provides improvement only if the imbalance is high Preemptive sharing allows correcting the imbalance and leads to performance improvement

27 Overall execution time (seconds) Experiment: Cluster-level processes/app 6 processes/app 8 processes/app batch scheduling 4-way sharing Preemptive 2-way sharing Scheduling scheme 2 nodes with 7 GPUs Batch scheduler is unable to schedule jobs with more processes than GPUs 4-way sharing and preemptive sharing lead to 25-30% and 40-45% performance improvement, respectively

28 Conclusion Node-level runtime providing GPU virtualization (Manageability) GPU sharing (Utilization, Latency Hiding) Flexible scheduling (Configurability) Dynamic binding & Preemption (Utilization, Latency Hiding) What lies ahead Integration with cluster-level scheduler Dynamic scheduling at the cluster-level Power efficiency considerations 28

29 Thanks My coauthors MU MU MU MU AMD NEC You all for the attention! 29

30 Understanding GPU resource utilization (cont d) Y # thread-blocks < # SMs N GPU GPU b 11 b 12 b 13 k 1 k1 SM underutilization b 11 b 12 b 13 b 14 b 15 TIME SHARING b 11 b 21 b 12 b 22 b 13 b 23 b 14 b 24 b 15 b 25 all SMs busy SPACE SHARING b 11 b 12 b 13 b 21 b 22 Best case N INTERLEAVED EXECUTION of THREAD-BLOCKS Latencies hiding co-scheduled thread-blocks have conflicting register/sh.mem req. Worst case Y SERIALIZED EXECUTION of THREAD-BLOCKS 30

31 Total Execution time (sec) Experiments: runtime overhead CUDA Runtime 1 vgpu 2 vgpus 4 vgpus 8 vgpus # of jobs 1 Tesla C2050 GPU, short-running jobs overhead < 10% in worst case, and amortized through GPU sharing 31

32 HPDC 13 From single-process, single threaded applications to multi-process/multi-threaded applications Challenge: synchronizations (e.g. barrier synchronizations, communication primitive) can introduce GPU underutilization Solution: Preemptive GPU sharing 32

33 Scenario 1: Intra-application Imbalance (a) Batch Scheduling A 11 A 21 sync A1 sync A2 sync C1 sync B1 GPU 0 A 12 B 12 GPU 1 A 31 A 22 B 11 GPU 2 A 32 C 11 GPU 3 A 41 A 42 C 21 (b) Controlled 2-way sharing GPU 0 A 11 B 11 GPU 1 A 21 sync A1 sync B1 sync C1 A 12 A 22 B 21 sync A2 GPU 2 GPU 3 A 41 A 31 C 11 A 32 C 21 A 42 (c) Preemptive sharing sync A1 sync C1 sync B1 sync A2 GPU 0 B 21 A 11 A 21 GPU 1 GPU 2 A 31 C 21 A 12 A 22 A 32 GPU 3 A 41 B 11 C 11 A 42 time 33

34 Scenario 2: Inter-application Imbalance (a) Batch scheduling GPU 0 vgpu 00 GPU 1 vgpu 10 vgpu 20 GPU 2 vgpu 30 A 11 A 12 A 21 A 22 A 32 A 13 A 23 A 33 GPU 3 A 31 B 11 B 21 B 22 B 32 B 23 C 31 B 31 B 33 C 11 C 32 C 12 C 33 C 13 B 12 B 13 C 21 C 22 C 23 (b) Controlled 2-way sharing GPU 0 vgpu 00 vgpu 01 A 11 B 21 A 12 B 22 A 13 B 23 C 31 C 32 C 33 GPU 1 vgpu 10 vgpu 11 A 21 B 31 A 22 B 32 A 23 B 33 GPU 2 vgpu 20 vgpu 21 A 31 C 11 A 32 A 33 C 12 C 13 GPU 3 vgpu 30 vgpu 31 B 11 C 21 B 1,2 B 13 C 22 C 23 (c) Preemptive sharing GPU 0 GPU 1 vgpu 00 vgpu10 A 11 A 21 A 12 A 22 A 13 A 23 B 33 C 21 B 13 C 31 C 22 C 32 C 23 C 33 GPU 2 GPU 3 vgpu20 vgpu30 B 11 A 31 A 32 A 33 C 11 B 21 B 31 B 32 B 12 B 22 B 23 C 12 C 13 (d) Preemptive 2-way sharing GPU 0 vgpu 00 vgpu 01 A 11 B 21 A 12 A 13 B 23 Legend GPU 1 GPU 2 GPU 3 vgpu 10 vgpu 11 vgpu 20 vgpu 21 vgpu 30 vgpu 31 B 11 A 12 B 31 A 13 C 11 C 21 C 31 A 22 B C 12 B 1,2 C C 32 B 22 A 23 B 33 A 33 C 13 C 33 B 13 C 23 App A App B App C Idle time App A sync. point App B sync. point App C sync. point time 34

35 Types of swapping operations Inter-application swapping Time-sharing of GPU among applications with conflicting memory requirements Intra-application swapping Memory footprint of one application is the memory footprint of the largest kernel malloc(&a_d, size); malloc(&b_d, size); malloc(&c_d, size); copy HD (A_d, A_h, size); matmul(a_d, A_d, B_d); //B_d = A_d * A_d matmul(b_d, B_d, C_d); //C_d = B_d * B_d copy DH (B_h, B_d, size); copy DH (C_h, C_d, size); 35

36 Types of swapping operations Inter-application swapping Time-sharing of GPU among applications with conflicting memory requirements Intra-application swapping Memory footprint of one application is the memory footprint of the largest kernel malloc(&a_d, size); ON THE BARE CUDA RUNTIME malloc(&b_d, size); malloc(&c_d, size); MEMORY CAPACITY EXCEEDED RUNTIME ERROR! copy HD (A_d, A_h, size); matmul(a_d, A_d, B_d); //B_d = A_d * A_d matmul(b_d, B_d, C_d); //C_d = B_d * B_d copy DH (B_h, B_d, size); copy DH (C_h, C_d, size); 36

37 Types of swapping operations Inter-application swapping Time-sharing of GPU among applications with conflicting memory requirements Intra-application swapping Memory footprint of one application is the memory footprint of the largest kernel malloc(&a_d, size); malloc(&b_d, size); malloc(&c_d, size); copy HD (A_d, A_h, size); matmul(a_d, A_d, B_d); matmul(b_d, B_d, C_d); copy DH (B_h, B_d, size); copy DH (C_h, C_d, size); ON OUR RUNTIME FIRST MEMORY ALLOCATION & DATA XFER TO GPU (A_d & B_d) SWAP(A_d) & MEMORY ALLOCATION (C_d) 37

38 Total execution time (sec) Experiments: load balancing w/ dynamic binding no load balancing load balancing through dynamic binding cpu fraction = 0 cpu fraction = 1 # of jobs Unbalanced system: 2 Tesla C2050 and 1 Quadro 2000 GPUs Especially on small batches of jobs, dynamic binding improves performance 38

39 Runtime configurations Only initial memory transfer deferral Only memory transfers before 1 st kernel call deferred Pros: Overlap computation/communication Cons: More swapping overhead Unconditional memory transfer deferral All memory transfers are deferred Pros: Less swapping overhead Cons: No overlapping computation/communication 39

40 Application call Actions performed by runtime Errors returned by the runtime Malloc Create PTE A virtual address cannot be assigned Allocate swap Swap memory cannot be allocated Copy HD Check valid PTE No valid PTE Move data to swap Swap-data size mismatch Copy DH Check valid PTE No valid PTE If (PTE.toCopy2Swap)cudaMemcpy DH - Free Check valid PTE No valid PTE De-allocate swap Cannot de-allocate swap If (PTE.isAllocated) - cudafree Launch Check valid PTE No valid PTE If (^PTE.isAllocated) cudamalloc - If (PTE.toCopy2Dev) cudamemcpy HD - cudalaunch - Swap Check valid PTE No valid PTE If (PTE.toCopy2Swap) cudamemcpy DH - If (PTE.isAllocated) cudafree - 40

41 Flags for Page Table Entries handling isallocated/tocopy2d/tocopy2s copydh swap F/F/F copyhd launch F/T/F copyhd copydh swap launch swap swap T/F/T T/F/F copyhd T/T/F copydh copyhd copydh launch copydh 41

Speeding up the execution of numerical computations and simulations with rcuda José Duato

Speeding up the execution of numerical computations and simulations with rcuda José Duato Speeding up the execution of numerical computations and simulations with rcuda José Duato Universidad Politécnica de Valencia Spain Outline 1. Introduction to GPU computing 2. What is remote GPU virtualization?

More information

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain Is remote virtualization useful? Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC Advisory Council Spain Conference 2015 2/57 We deal with s, obviously!

More information

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain Remote virtualization: pros and cons of a recent technology Federico Silla Technical University of Valencia Spain The scope of this talk HPC Advisory Council Brazil Conference 2015 2/43 st Outline What

More information

Deploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain

Deploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain Deploying remote virtualization with rcuda Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC ADMINTECH 2016 2/53 It deals with s, obviously! HPC ADMINTECH

More information

Improving overall performance and energy consumption of your cluster with remote GPU virtualization

Improving overall performance and energy consumption of your cluster with remote GPU virtualization Improving overall performance and energy consumption of your cluster with remote GPU virtualization Federico Silla & Carlos Reaño Technical University of Valencia Spain Tutorial Agenda 9.00-10.00 SESSION

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

rcuda: hybrid CPU-GPU clusters Federico Silla Technical University of Valencia Spain

rcuda: hybrid CPU-GPU clusters Federico Silla Technical University of Valencia Spain rcuda: hybrid - clusters Federico Silla Technical University of Valencia Spain Outline 1. Hybrid - clusters 2. Concerns with hybrid clusters 3. One possible solution: virtualize s! 4. rcuda what s that?

More information

Announcements. Program #1. Program #0. Reading. Is due at 9:00 AM on Thursday. Re-grade requests are due by Monday at 11:59:59 PM.

Announcements. Program #1. Program #0. Reading. Is due at 9:00 AM on Thursday. Re-grade requests are due by Monday at 11:59:59 PM. Program #1 Announcements Is due at 9:00 AM on Thursday Program #0 Re-grade requests are due by Monday at 11:59:59 PM Reading Chapter 6 1 CPU Scheduling Manage CPU to achieve several objectives: maximize

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla The rcuda technology: an inexpensive way to improve the performance of -based clusters Federico Silla Technical University of Valencia Spain The scope of this talk Delft, April 2015 2/47 More flexible

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Announcements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5)

Announcements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5) Announcements Reading Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) 1 Relationship between Kernel mod and User Mode User Process Kernel System Calls User Process

More information

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain Opportunities of the rcuda remote virtualization middleware Federico Silla Universitat Politècnica de València Spain st Outline What is rcuda? HPC Advisory Council China Conference 2017 2/45 s are the

More information

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0 Sanjiv Satoor CUDA TOOLS 2 NVIDIA NSIGHT Homogeneous application development for CPU+GPU compute platforms CUDA-Aware Editor CUDA Debugger CPU+GPU CUDA Profiler

More information

Announcements. Program #1. Reading. Due 2/15 at 5:00 pm. Finish scheduling Process Synchronization: Chapter 6 (8 th Ed) or Chapter 7 (6 th Ed)

Announcements. Program #1. Reading. Due 2/15 at 5:00 pm. Finish scheduling Process Synchronization: Chapter 6 (8 th Ed) or Chapter 7 (6 th Ed) Announcements Program #1 Due 2/15 at 5:00 pm Reading Finish scheduling Process Synchronization: Chapter 6 (8 th Ed) or Chapter 7 (6 th Ed) 1 Scheduling criteria Per processor, or system oriented CPU utilization

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Parallel Computing. Lecture 19: CUDA - I

Parallel Computing. Lecture 19: CUDA - I CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4

More information

So far. Next: scheduling next process from Wait to Run. 1/31/08 CSE 30341: Operating Systems Principles

So far. Next: scheduling next process from Wait to Run. 1/31/08 CSE 30341: Operating Systems Principles So far. Firmware identifies hardware devices present OS bootstrap process: uses the list created by firmware and loads driver modules for each detected hardware. Initializes internal data structures (PCB,

More information

CS3733: Operating Systems

CS3733: Operating Systems CS3733: Operating Systems Topics: Process (CPU) Scheduling (SGG 5.1-5.3, 6.7 and web notes) Instructor: Dr. Dakai Zhu 1 Updates and Q&A Homework-02: late submission allowed until Friday!! Submit on Blackboard

More information

Course Syllabus. Operating Systems

Course Syllabus. Operating Systems Course Syllabus. Introduction - History; Views; Concepts; Structure 2. Process Management - Processes; State + Resources; Threads; Unix implementation of Processes 3. Scheduling Paradigms; Unix; Modeling

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Process Scheduling. Copyright : University of Illinois CS 241 Staff

Process Scheduling. Copyright : University of Illinois CS 241 Staff Process Scheduling Copyright : University of Illinois CS 241 Staff 1 Process Scheduling Deciding which process/thread should occupy the resource (CPU, disk, etc) CPU I want to play Whose turn is it? Process

More information

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition Chapter 6: CPU Scheduling Silberschatz, Galvin and Gagne 2013 Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Real-Time

More information

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016 Xen and the Art of Virtualization CSE-291 (Cloud Computing) Fall 2016 Why Virtualization? Share resources among many uses Allow heterogeneity in environments Allow differences in host and guest Provide

More information

CS140 Operating Systems Midterm Review. Feb. 5 th, 2009 Derrick Isaacson

CS140 Operating Systems Midterm Review. Feb. 5 th, 2009 Derrick Isaacson CS140 Operating Systems Midterm Review Feb. 5 th, 2009 Derrick Isaacson Midterm Quiz Tues. Feb. 10 th In class (4:15-5:30 Skilling) Open book, open notes (closed laptop) Bring printouts You won t have

More information

Lecture 3: Introduction to CUDA

Lecture 3: Introduction to CUDA CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

Containerizing GPU Applications with Docker for Scaling to the Cloud

Containerizing GPU Applications with Docker for Scaling to the Cloud Containerizing GPU Applications with Docker for Scaling to the Cloud SUBBU RAMA FUTURE OF PACKAGING APPLICATIONS Turns Discrete Computing Resources into a Virtual Supercomputer GPU Mem Mem GPU GPU Mem

More information

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering

More information

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition Chapter 6: CPU Scheduling Silberschatz, Galvin and Gagne 2013 Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Real-Time

More information

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014 Lecture 13: Memory Consistency + Course-So-Far Review Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes Beggin Madcon (So Dark the Con of Man) 15-418 students tend to

More information

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Processes and Non-Preemptive Scheduling. Otto J. Anshus Processes and Non-Preemptive Scheduling Otto J. Anshus Threads Processes Processes Kernel An aside on concurrency Timing and sequence of events are key concurrency issues We will study classical OS concurrency

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Properties of Processes

Properties of Processes CPU Scheduling Properties of Processes CPU I/O Burst Cycle Process execution consists of a cycle of CPU execution and I/O wait. CPU burst distribution: CPU Scheduler Selects from among the processes that

More information

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware 2010 VMware Inc. All rights reserved About the Speaker Hemant Gaidhani Senior Technical

More information

CSE 160 Lecture 24. Graphical Processing Units

CSE 160 Lecture 24. Graphical Processing Units CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of

More information

Introduction. CS3026 Operating Systems Lecture 01

Introduction. CS3026 Operating Systems Lecture 01 Introduction CS3026 Operating Systems Lecture 01 One or more CPUs Device controllers (I/O modules) Memory Bus Operating system? Computer System What is an Operating System An Operating System is a program

More information

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Critically Missing Pieces on Accelerators: A Performance Tools Perspective Critically Missing Pieces on Accelerators: A Performance Tools Perspective, Karthik Murthy, Mike Fagan, and John Mellor-Crummey Rice University SC 2013 Denver, CO November 20, 2013 What Is Missing in GPUs?

More information

CPU Scheduling: Objectives

CPU Scheduling: Objectives CPU Scheduling: Objectives CPU scheduling, the basis for multiprogrammed operating systems CPU-scheduling algorithms Evaluation criteria for selecting a CPU-scheduling algorithm for a particular system

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization Presenters: Tim Kaldewey Performance Architect, Watson Group Michael Gschwind Chief Engineer ML & DL, Systems Group David K.

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University CS 571 Operating Systems Midterm Review Angelos Stavrou, George Mason University Class Midterm: Grading 2 Grading Midterm: 25% Theory Part 60% (1h 30m) Programming Part 40% (1h) Theory Part (Closed Books):

More information

Chapter 5: CPU Scheduling

Chapter 5: CPU Scheduling Chapter 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Operating Systems Examples Algorithm Evaluation Chapter 5: CPU Scheduling

More information

Abstract. Testing Parameters. Introduction. Hardware Platform. Native System

Abstract. Testing Parameters. Introduction. Hardware Platform. Native System Abstract In this paper, we address the latency issue in RT- XEN virtual machines that are available in Xen 4.5. Despite the advantages of applying virtualization to systems, the default credit scheduler

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain Increasing the efficiency of your -enabled cluster with rcuda Federico Silla Technical University of Valencia Spain Outline Why remote virtualization? How does rcuda work? The performance of the rcuda

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 10 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Chapter 6: CPU Scheduling Basic Concepts

More information

references Virtualization services Topics Virtualization

references Virtualization services Topics Virtualization references Virtualization services Virtual machines Intel Virtualization technology IEEE xplorer, May 2005 Comparison of software and hardware techniques for x86 virtualization ASPLOS 2006 Memory resource

More information

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model

More information

Subject Name:Operating system. Subject Code:10EC35. Prepared By:Remya Ramesan and Kala H.S. Department:ECE. Date:

Subject Name:Operating system. Subject Code:10EC35. Prepared By:Remya Ramesan and Kala H.S. Department:ECE. Date: Subject Name:Operating system Subject Code:10EC35 Prepared By:Remya Ramesan and Kala H.S. Department:ECE Date:24-02-2015 UNIT 1 INTRODUCTION AND OVERVIEW OF OPERATING SYSTEM Operating system, Goals of

More information

Virtualization and the Metrics of Performance & Capacity Management

Virtualization and the Metrics of Performance & Capacity Management 23 S September t b 2011 Virtualization and the Metrics of Performance & Capacity Management Has the world changed? Mark Preston Agenda Reality Check. General Observations Traditional metrics for a non-virtual

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this

More information

Changing landscape of computing at BNL

Changing landscape of computing at BNL Changing landscape of computing at BNL Shared Pool and New Users and Tools HTCondor Week May 2018 William Strecker-Kellogg Shared Pool Merging 6 HTCondor Pools into 1 2 What? Current Situation

More information

IVM: A Task-based Shared Memory Programming Model and Runtime System to Enable Uniform Access to CPU-GPU Clusters

IVM: A Task-based Shared Memory Programming Model and Runtime System to Enable Uniform Access to CPU-GPU Clusters IVM: A Task-based Shared Memory Programming Model and Runtime System to Enable Uniform Access to CPU-GPU Clusters Kittisak Sajjapongse, Ruidong Gu, Michela Becchi University of Missouri - Dept. of Electrical

More information

The rcuda middleware and applications

The rcuda middleware and applications The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,

More information

NLVMUG 16 maart Display protocols in Horizon

NLVMUG 16 maart Display protocols in Horizon NLVMUG 16 maart 2017 Display protocols in Horizon NLVMUG 16 maart 2017 Display protocols in Horizon Topics Introduction Display protocols - Basics PCoIP vs Blast Extreme Optimizing Monitoring Future Recap

More information

LECTURE 3:CPU SCHEDULING

LECTURE 3:CPU SCHEDULING LECTURE 3:CPU SCHEDULING 1 Outline Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time CPU Scheduling Operating Systems Examples Algorithm Evaluation 2 Objectives

More information

OPERATING SYSTEMS CS3502 Spring Processor Scheduling. Chapter 5

OPERATING SYSTEMS CS3502 Spring Processor Scheduling. Chapter 5 OPERATING SYSTEMS CS3502 Spring 2018 Processor Scheduling Chapter 5 Goals of Processor Scheduling Scheduling is the sharing of the CPU among the processes in the ready queue The critical activities are:

More information

COLLIN LEE INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM

COLLIN LEE INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM COLLIN LEE INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM GOAL OF THIS TALK Introduce design ideas and issues for a granular computing

More information

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017 CS 471 Operating Systems Yue Cheng George Mason University Fall 2017 Outline o Process concept o Process creation o Process states and scheduling o Preemption and context switch o Inter-process communication

More information

Heidi Poxon Cray Inc.

Heidi Poxon Cray Inc. Heidi Poxon Topics GPU support in the Cray performance tools CUDA proxy MPI support for GPUs (GPU-to-GPU) 2 3 Programming Models Supported for the GPU Goal is to provide whole program analysis for programs

More information

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) Past & Present Have looked at two constraints: Mutual exclusion constraint between two events is a requirement that

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

CPU Scheduling: Part I ( 5, SGG) Operating Systems. Autumn CS4023

CPU Scheduling: Part I ( 5, SGG) Operating Systems. Autumn CS4023 Operating Systems Autumn 2017-2018 Outline 1 CPU Scheduling: Part I ( 5, SGG) Outline CPU Scheduling: Part I ( 5, SGG) 1 CPU Scheduling: Part I ( 5, SGG) Basic Concepts Typical program behaviour CPU Scheduling:

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Start of Lecture: February 10, Chapter 6: Scheduling

Start of Lecture: February 10, Chapter 6: Scheduling Start of Lecture: February 10, 2014 1 Reminders Exercise 2 due this Wednesday before class Any questions or comments? 2 Scheduling so far First-Come-First Serve FIFO scheduling in queue without preempting

More information

Operating Systems. Process scheduling. Thomas Ropars.

Operating Systems. Process scheduling. Thomas Ropars. 1 Operating Systems Process scheduling Thomas Ropars thomas.ropars@univ-grenoble-alpes.fr 2018 References The content of these lectures is inspired by: The lecture notes of Renaud Lachaize. The lecture

More information

Real-Time Support for GPU. GPU Management Heechul Yun

Real-Time Support for GPU. GPU Management Heechul Yun Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks

More information

CPU Scheduling. CSE 2431: Introduction to Operating Systems Reading: Chapter 6, [OSC] (except Sections )

CPU Scheduling. CSE 2431: Introduction to Operating Systems Reading: Chapter 6, [OSC] (except Sections ) CPU Scheduling CSE 2431: Introduction to Operating Systems Reading: Chapter 6, [OSC] (except Sections 6.7.2 6.8) 1 Contents Why Scheduling? Basic Concepts of Scheduling Scheduling Criteria A Basic Scheduling

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Automatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP

Automatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP Automatic NUMA Balancing Agenda What is NUMA, anyway? Automatic NUMA balancing internals

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010 CS4961 Parallel Programming Lecture 14: Reasoning about Performance Administrative: What s Coming Programming assignment 2 due Friday, 11:59PM Homework assignment out on Tuesday, Oct. 19 and due Monday,

More information

Machine Learning on VMware vsphere with NVIDIA GPUs

Machine Learning on VMware vsphere with NVIDIA GPUs Machine Learning on VMware vsphere with NVIDIA GPUs Uday Kurkure, Hari Sivaraman, Lan Vu GPU Technology Conference 2017 2016 VMware Inc. All rights reserved. Gartner Hype Cycle for Emerging Technology

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

Introduction to parallel computing

Introduction to parallel computing Introduction to parallel computing 3. Parallel Software Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Last time Parallel hardware Multi-core

More information

Chap 7, 8: Scheduling. Dongkun Shin, SKKU

Chap 7, 8: Scheduling. Dongkun Shin, SKKU Chap 7, 8: Scheduling 1 Introduction Multiprogramming Multiple processes in the system with one or more processors Increases processor utilization by organizing processes so that the processor always has

More information

The MOSIX Algorithms for Managing Cluster, Multi-Clusters, GPU Clusters and Clouds

The MOSIX Algorithms for Managing Cluster, Multi-Clusters, GPU Clusters and Clouds The MOSIX Algorithms for Managing Cluster, Multi-Clusters, GPU Clusters and Clouds Prof. Amnon Barak Department of Computer Science The Hebrew University of Jerusalem http:// www. MOSIX. Org 1 Background

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

Lecture Topics. Announcements. Today: Uniprocessor Scheduling (Stallings, chapter ) Next: Advanced Scheduling (Stallings, chapter

Lecture Topics. Announcements. Today: Uniprocessor Scheduling (Stallings, chapter ) Next: Advanced Scheduling (Stallings, chapter Lecture Topics Today: Uniprocessor Scheduling (Stallings, chapter 9.1-9.3) Next: Advanced Scheduling (Stallings, chapter 10.1-10.4) 1 Announcements Self-Study Exercise #10 Project #8 (due 11/16) Project

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 What is an Operating System? What is

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information