Design of a Virtualization Framework to Enable GPU Sharing in Cluster Environments
|
|
- Lydia Thomas
- 6 years ago
- Views:
Transcription
1 Design of a Virtualization Framework to Enable GPU Sharing in Cluster Environments Michela Becchi University of Missouri nps.missouri.edu
2 GPUs in Clusters & Clouds Many-core GPUs are used in supercomputers 3 out of the top 10 supercomputers use GPUs Titan: > 20 petaflops, > 700 terabytes memory 18,688 nodes: 16-core AMD CPU 1 Nvidia Tesla K20 GPU Many-core GPUs are used in cloud computing 2
3 Different usage paradigms Accelerator model Cluster/cloud model 1 application Multi-tenancy GPU: dedicated resource Explicit procurement of GPUs Static (or programmer-defined) binding of application to GPUs Intra-application scheduling GPU: shared resource Resource virtualization & Transparency Dynamic (or runtime) binding of applications to GPUs better resource utilization and load balancing Intra- and Inter-application scheduling Memory management within application Advanced memory management across applications required 3
4 Context AMBER GROMACS NAMD GPUBlast LAMMPS NAMD AMBER Blast
5 We have designed a runtime that Abstracts GPUs from end-users Schedules applications on GPUs Dynamically binds applications to GPUs Allows GPU sharing Provides memory management Provides dynamic recovery and load balancing in case of GPU failure/upgrade/downgrade 5
6 Deployment scenarios With cluster-level schedulers E.g.: TORQUE, SLURM Cluster-level scheduler CUDA app CUDA app CUDA app CUDA app CUDA app CUDA app Intercept library Intercept library Intercept library Our RUNTIME OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n Our RUNTIME OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n Our RUNTIME OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n With VM-based systems for cloud computing E.g.: Eucalyptus VM 1 VM 2 VM 3 VM 4 VM k VM n CUDA app Intercept library guest OS CUDA app Intercept library guest OS CUDA app Intercept library guest OS CUDA app Intercept library guest OS CUDA app Intercept library guest OS CUDA app Intercept library guest OS VM manager Our RUNTIME host OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n Our RUNTIME host OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n Our RUNTIME host OS CUDA GPU driver/runtime 1 GPU 2 GPU 3 GPU n 6
7 GPU sharing Inter-kernel sharing [HPDC 11] - When: GPU underutilized within a kernel - Why: limited parallelism, small datasets - How: kernel consolidation across applications GPU k 1 k 1 k 1 k 1 k1 k 1 k 1 k 1 k 2 k 2 Inter-application sharing [HPDC 12] - When: GPU underutilized within an application - Why: long CPU phases - How: application multiplexing on GPU GPU CPU app 1 app 1 app 1 app 1 app 2 app 1 app 1 app 2 time time 7
8 GPU sharing Multi-process application sharing [HPDC 13] When: GPU underutilized by multi-process applications (e.g. MPI) Why: Synchronization leads to intra- & inter-application imbalance How: Preempt some inactive processes to allow other processes to progress GPU 0 GPU 1 A 0 A 1 time A 0 A 1 B 0 B 1 B 0 B 1 GPU 0 GPU 1 A 0 B 0 A 0 B 1 B 0 A 1 A 1 B 1 time 8
9 Inter-kernel sharing [HPDC 11] app 1 m c HD k 1 c DH f app 2 m c HD k 2 c DH f serialized execution m c HD k 1 c DH f m c HD k 2 c DH f Inter-kernel sharing m c HD m c HD combined k 1 & k 2 c DH f c DH f time app 1 and app 2 have no conflicting memory requirements 9
10 Relative Throughput Benefit Space- vs. time-sharing: some results SPACE-SHARING TIME-SHARING BATCH 1 BATCH 2 BATCH 3 BATCH 4 GPU1 GPU2 GPU2 GPU1 GPU1 GPU2 GPU GPU1 1.3 BS+KM BO+KNN PDE+MD EU+IP BS+BO KM+KNN BO+EU BS+MD Workload Mix 10
11 Idea: Molding Downgrade the execution configuration of kernels so to force beneficial sharing Penalize single application to improve overall throughput Limiting # blocks force space sharing Limiting # threads/block force time sharing w/ interleaved execution kernel 1 b 11 b 12 b 13 b 14 k1 Downgrade to 3 blocks kernel 2 b21 b22 b23 b14 k1 Downgrade to 2 blocks b 11 b 12 b 13 b 21 b 22 kernel 1 and kernel 2 can space-share GPU after molding 11
12 Relative Throughput Benefit Molding: some results FORCED TIME-SHARING No Molding Molding FORCED SPACE-SHARING GPU2 1.4 GPU1 GPU2 GPU IP+BS PDE+MD IP+BO BS+KM Worklaod Mix Molding can improve overall throughput despite penalizing single applications 12
13 Inter-application sharing [HPDC 12] app 1 m c HD k 11 cpu k 12 c DH f cpu app 2 cpu m c HD k 21 c DH f cpu m c HD k 22 cpu k 23 c DH f serialized execution GPU CPU sharing w/o conflicting memory req. GPU CPU sharing w/ conflicting memory req. GPU CPU GPU CPU xfer (app 1 ) CPU GPU xfer (app 1 ) time 13
14 Our runtime: node-level view app 1 app 2 app 3 app j app N Intercept Lib Intercept Lib Intercept Lib Intercept Lib Intercept Lib Our RUNTIME Memory manager Virtual memory handling Memory manager Page table Swap area connection manager and offload control dispatcher Waiting contexts Assigned contexts Failed contexts node-to-node offloading Dispatcher Scheduling GPU binary registration vgpu 11 vgpu 12 vgpu 1k vgpu 21 vgpu 22 vgpu 2k vgpu n1 vgpu n2 vgpu nk CUDA driver/runtime GPU 1 GPU 2 GPU n Virtual GPUs Abstraction GPU Sharing 14
15 Mapping and scheduling (FCFS) app 1 app 2 app 3 t 1 t 2 t 1 t 2 t 3 t 1 t 2 t 3 FE library FE library FE library Our RUNTIME Memory manager connection manager and offload control c 11 c 12 c 21 c 22 c 23 c 31 c 32 c 33 Waiting ctx Page table Swap area dispatcher Assigned ctx Failed ctx vgpu 11 vgpu 12 vgpu 21 vgpu 22 vgpu 31 vgpu 32 GPU 1 CUDA driver/runtime GPU 2 GPU 3 15
16 Mapping and scheduling (FCFS) app 1 app 2 app 3 t 1 t 2 t 1 t 2 t 3 t 1 t 2 t 3 FE library FE library FE library Our RUNTIME Memory manager Page table Swap area connection manager and offload control dispatcher c 11 c 12 c 21 c 22 c 23 c 31 c 32 c 33 c 11 c 12 c 21 c 22 c 23 c 31 c 32 Waiting ctx Assigned ctx Failed ctx Hardware configuration and application- GPU mapping abstracted from end-users complete! vgpu 11 vgpu 12 vgpu 21 vgpu 22 vgpu 31 vgpu 32 Time-sharing of GPUs GPU 1 CUDA driver/runtime GPU 2 GPU 3 16
17 Delayed binding app 1 malloc 1 copyhd 11 copyhd 12 kernel 1 copydh 1 Memory manager Page table app 1 app 2 app j app N malloc 2 copyhd 2 kernel 2 copydh 2 Swap area d1 connection manager and offload control c 1 dispatcher c 2 Deferral of application-gpu mapping o Better scheduling decisions o GPU memory allocation when needed app 2 d2 CUDA driver/runtime vgpu 11 vgpu 12 Memory manager in runtime o dynamic binding GPU 1 17
18 Dynamic binding & swapping app 1 app 2 app 3 app 4 Memory manager Page table app 1 malloc 1 copyhd 1 kernel 11 kernel 12 Swap area d1 malloc 2 copyhd 2 kernel 2 GPU 1 malloc 3 copyhd 3 kernel 3 dispatcher malloc 4 copyhd 4 kernel 4 app 2 d2 app 3 d3 c 1 c 3 c 4 app 4 d4 vgpu 11 vgpu 12 vgpu 21 vgpu 22 Swap! d1 connection manager and offload control c 2 CUDA driver/runtime Full! d3 d1 GPU 2 d4 GPU sharing among applications with conflicting memory requirements Migration of applications from slower to faster GPUs High availability in case of GPU failure Load balancing in case of GPU upgrade-downgrade 18
19 Total Execution time (sec) Experiments: sharing & swapping serialized execution (1 vgpu) GPU sharing (4 VGPUs) Fraction of CPU code 2 Tesla C2050 and 1 Tesla C1060 GPUs 36 matmul jobs w/ 5 kernel calls and varying CPU phases Sharing increases performances by hiding CPU phases 19
20 Execution Time (sec) Experiments: cluster w/ TORQUE serialized execution GPU sharing (4 vgpus) GPU sharing + load balancing Tot (16 jobs) Avg (16 jobs) Tot (32 jobs) Avg (32 jobs) Tot (48 jobs) Avg (48 jobs) Metric (# jobs) 3 node cluster w/ 2 GPU nodes (2 Tesla C2050s and 1 C1060) >2X performance improvement due to sharing, further 20% due to offloading 20
21 Load Imbalance in Multi-process applications [HPDC 13] Causes of load imbalance Intrinsic load-imbalance (intra-application) Different GPU capabilities (intra-application) Unmatched amount of GPUs to processes (inter-application) Synchronization among processes
22 Intra-application Imbalance
23 Inter-application Imbalance
24 Preemption Policies Maximum idle time-driven preemption Preempt a context (process) whenever it does not utilize the GPU for a predefined amount of time PROS: easy implementation CONS: need for setting maximum idle time parameter Synchronization call-driven preemption Preempt a context (process) whenever a collective communication or synchronization call is serviced PROS: no need for parameters setting CONS: Either need for bookkeeping (complex implementation, overhead) Or unnecessary preemptions (last process enters synchronization point)
25 Overall execution time (seconds) Experiment: Node-level % Batch scheduling 2-way sharing 4-way sharing Preemptive sharing Preemptive 2-way sharing % Percentage imbalance Intra-application imbalance Batch scheduler fails to capture intra-application imbalance N-way sharing hides CPU execution behind GPU execution phases of co-located processes Combining 2-way sharing and preemption further improves performance
26 Overall execution time (seconds) Experiment: Node-level Batch scheduling 2-way sharing 4-way sharing Preemptive sharing Preemptive 2-way sharing x[4] + 1x[3] 2x[4] + 2x[3] 1x[4] + 3x[3] 4x[3] Workload composition Inter-application imbalance Batch scheduler causes GPU underutilization, leading to performance loss N-way sharing provides improvement only if the imbalance is high Preemptive sharing allows correcting the imbalance and leads to performance improvement
27 Overall execution time (seconds) Experiment: Cluster-level processes/app 6 processes/app 8 processes/app batch scheduling 4-way sharing Preemptive 2-way sharing Scheduling scheme 2 nodes with 7 GPUs Batch scheduler is unable to schedule jobs with more processes than GPUs 4-way sharing and preemptive sharing lead to 25-30% and 40-45% performance improvement, respectively
28 Conclusion Node-level runtime providing GPU virtualization (Manageability) GPU sharing (Utilization, Latency Hiding) Flexible scheduling (Configurability) Dynamic binding & Preemption (Utilization, Latency Hiding) What lies ahead Integration with cluster-level scheduler Dynamic scheduling at the cluster-level Power efficiency considerations 28
29 Thanks My coauthors MU MU MU MU AMD NEC You all for the attention! 29
30 Understanding GPU resource utilization (cont d) Y # thread-blocks < # SMs N GPU GPU b 11 b 12 b 13 k 1 k1 SM underutilization b 11 b 12 b 13 b 14 b 15 TIME SHARING b 11 b 21 b 12 b 22 b 13 b 23 b 14 b 24 b 15 b 25 all SMs busy SPACE SHARING b 11 b 12 b 13 b 21 b 22 Best case N INTERLEAVED EXECUTION of THREAD-BLOCKS Latencies hiding co-scheduled thread-blocks have conflicting register/sh.mem req. Worst case Y SERIALIZED EXECUTION of THREAD-BLOCKS 30
31 Total Execution time (sec) Experiments: runtime overhead CUDA Runtime 1 vgpu 2 vgpus 4 vgpus 8 vgpus # of jobs 1 Tesla C2050 GPU, short-running jobs overhead < 10% in worst case, and amortized through GPU sharing 31
32 HPDC 13 From single-process, single threaded applications to multi-process/multi-threaded applications Challenge: synchronizations (e.g. barrier synchronizations, communication primitive) can introduce GPU underutilization Solution: Preemptive GPU sharing 32
33 Scenario 1: Intra-application Imbalance (a) Batch Scheduling A 11 A 21 sync A1 sync A2 sync C1 sync B1 GPU 0 A 12 B 12 GPU 1 A 31 A 22 B 11 GPU 2 A 32 C 11 GPU 3 A 41 A 42 C 21 (b) Controlled 2-way sharing GPU 0 A 11 B 11 GPU 1 A 21 sync A1 sync B1 sync C1 A 12 A 22 B 21 sync A2 GPU 2 GPU 3 A 41 A 31 C 11 A 32 C 21 A 42 (c) Preemptive sharing sync A1 sync C1 sync B1 sync A2 GPU 0 B 21 A 11 A 21 GPU 1 GPU 2 A 31 C 21 A 12 A 22 A 32 GPU 3 A 41 B 11 C 11 A 42 time 33
34 Scenario 2: Inter-application Imbalance (a) Batch scheduling GPU 0 vgpu 00 GPU 1 vgpu 10 vgpu 20 GPU 2 vgpu 30 A 11 A 12 A 21 A 22 A 32 A 13 A 23 A 33 GPU 3 A 31 B 11 B 21 B 22 B 32 B 23 C 31 B 31 B 33 C 11 C 32 C 12 C 33 C 13 B 12 B 13 C 21 C 22 C 23 (b) Controlled 2-way sharing GPU 0 vgpu 00 vgpu 01 A 11 B 21 A 12 B 22 A 13 B 23 C 31 C 32 C 33 GPU 1 vgpu 10 vgpu 11 A 21 B 31 A 22 B 32 A 23 B 33 GPU 2 vgpu 20 vgpu 21 A 31 C 11 A 32 A 33 C 12 C 13 GPU 3 vgpu 30 vgpu 31 B 11 C 21 B 1,2 B 13 C 22 C 23 (c) Preemptive sharing GPU 0 GPU 1 vgpu 00 vgpu10 A 11 A 21 A 12 A 22 A 13 A 23 B 33 C 21 B 13 C 31 C 22 C 32 C 23 C 33 GPU 2 GPU 3 vgpu20 vgpu30 B 11 A 31 A 32 A 33 C 11 B 21 B 31 B 32 B 12 B 22 B 23 C 12 C 13 (d) Preemptive 2-way sharing GPU 0 vgpu 00 vgpu 01 A 11 B 21 A 12 A 13 B 23 Legend GPU 1 GPU 2 GPU 3 vgpu 10 vgpu 11 vgpu 20 vgpu 21 vgpu 30 vgpu 31 B 11 A 12 B 31 A 13 C 11 C 21 C 31 A 22 B C 12 B 1,2 C C 32 B 22 A 23 B 33 A 33 C 13 C 33 B 13 C 23 App A App B App C Idle time App A sync. point App B sync. point App C sync. point time 34
35 Types of swapping operations Inter-application swapping Time-sharing of GPU among applications with conflicting memory requirements Intra-application swapping Memory footprint of one application is the memory footprint of the largest kernel malloc(&a_d, size); malloc(&b_d, size); malloc(&c_d, size); copy HD (A_d, A_h, size); matmul(a_d, A_d, B_d); //B_d = A_d * A_d matmul(b_d, B_d, C_d); //C_d = B_d * B_d copy DH (B_h, B_d, size); copy DH (C_h, C_d, size); 35
36 Types of swapping operations Inter-application swapping Time-sharing of GPU among applications with conflicting memory requirements Intra-application swapping Memory footprint of one application is the memory footprint of the largest kernel malloc(&a_d, size); ON THE BARE CUDA RUNTIME malloc(&b_d, size); malloc(&c_d, size); MEMORY CAPACITY EXCEEDED RUNTIME ERROR! copy HD (A_d, A_h, size); matmul(a_d, A_d, B_d); //B_d = A_d * A_d matmul(b_d, B_d, C_d); //C_d = B_d * B_d copy DH (B_h, B_d, size); copy DH (C_h, C_d, size); 36
37 Types of swapping operations Inter-application swapping Time-sharing of GPU among applications with conflicting memory requirements Intra-application swapping Memory footprint of one application is the memory footprint of the largest kernel malloc(&a_d, size); malloc(&b_d, size); malloc(&c_d, size); copy HD (A_d, A_h, size); matmul(a_d, A_d, B_d); matmul(b_d, B_d, C_d); copy DH (B_h, B_d, size); copy DH (C_h, C_d, size); ON OUR RUNTIME FIRST MEMORY ALLOCATION & DATA XFER TO GPU (A_d & B_d) SWAP(A_d) & MEMORY ALLOCATION (C_d) 37
38 Total execution time (sec) Experiments: load balancing w/ dynamic binding no load balancing load balancing through dynamic binding cpu fraction = 0 cpu fraction = 1 # of jobs Unbalanced system: 2 Tesla C2050 and 1 Quadro 2000 GPUs Especially on small batches of jobs, dynamic binding improves performance 38
39 Runtime configurations Only initial memory transfer deferral Only memory transfers before 1 st kernel call deferred Pros: Overlap computation/communication Cons: More swapping overhead Unconditional memory transfer deferral All memory transfers are deferred Pros: Less swapping overhead Cons: No overlapping computation/communication 39
40 Application call Actions performed by runtime Errors returned by the runtime Malloc Create PTE A virtual address cannot be assigned Allocate swap Swap memory cannot be allocated Copy HD Check valid PTE No valid PTE Move data to swap Swap-data size mismatch Copy DH Check valid PTE No valid PTE If (PTE.toCopy2Swap)cudaMemcpy DH - Free Check valid PTE No valid PTE De-allocate swap Cannot de-allocate swap If (PTE.isAllocated) - cudafree Launch Check valid PTE No valid PTE If (^PTE.isAllocated) cudamalloc - If (PTE.toCopy2Dev) cudamemcpy HD - cudalaunch - Swap Check valid PTE No valid PTE If (PTE.toCopy2Swap) cudamemcpy DH - If (PTE.isAllocated) cudafree - 40
41 Flags for Page Table Entries handling isallocated/tocopy2d/tocopy2s copydh swap F/F/F copyhd launch F/T/F copyhd copydh swap launch swap swap T/F/T T/F/F copyhd T/T/F copydh copyhd copydh launch copydh 41
Speeding up the execution of numerical computations and simulations with rcuda José Duato
Speeding up the execution of numerical computations and simulations with rcuda José Duato Universidad Politécnica de Valencia Spain Outline 1. Introduction to GPU computing 2. What is remote GPU virtualization?
More informationIs remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain
Is remote virtualization useful? Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC Advisory Council Spain Conference 2015 2/57 We deal with s, obviously!
More informationRemote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain
Remote virtualization: pros and cons of a recent technology Federico Silla Technical University of Valencia Spain The scope of this talk HPC Advisory Council Brazil Conference 2015 2/43 st Outline What
More informationDeploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain
Deploying remote virtualization with rcuda Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC ADMINTECH 2016 2/53 It deals with s, obviously! HPC ADMINTECH
More informationImproving overall performance and energy consumption of your cluster with remote GPU virtualization
Improving overall performance and energy consumption of your cluster with remote GPU virtualization Federico Silla & Carlos Reaño Technical University of Valencia Spain Tutorial Agenda 9.00-10.00 SESSION
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationrcuda: hybrid CPU-GPU clusters Federico Silla Technical University of Valencia Spain
rcuda: hybrid - clusters Federico Silla Technical University of Valencia Spain Outline 1. Hybrid - clusters 2. Concerns with hybrid clusters 3. One possible solution: virtualize s! 4. rcuda what s that?
More informationAnnouncements. Program #1. Program #0. Reading. Is due at 9:00 AM on Thursday. Re-grade requests are due by Monday at 11:59:59 PM.
Program #1 Announcements Is due at 9:00 AM on Thursday Program #0 Re-grade requests are due by Monday at 11:59:59 PM Reading Chapter 6 1 CPU Scheduling Manage CPU to achieve several objectives: maximize
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationThe rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla
The rcuda technology: an inexpensive way to improve the performance of -based clusters Federico Silla Technical University of Valencia Spain The scope of this talk Delft, April 2015 2/47 More flexible
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More informationAnnouncements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5)
Announcements Reading Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) 1 Relationship between Kernel mod and User Mode User Process Kernel System Calls User Process
More informationOpportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain
Opportunities of the rcuda remote virtualization middleware Federico Silla Universitat Politècnica de València Spain st Outline What is rcuda? HPC Advisory Council China Conference 2017 2/45 s are the
More informationNEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor
NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0 Sanjiv Satoor CUDA TOOLS 2 NVIDIA NSIGHT Homogeneous application development for CPU+GPU compute platforms CUDA-Aware Editor CUDA Debugger CPU+GPU CUDA Profiler
More informationAnnouncements. Program #1. Reading. Due 2/15 at 5:00 pm. Finish scheduling Process Synchronization: Chapter 6 (8 th Ed) or Chapter 7 (6 th Ed)
Announcements Program #1 Due 2/15 at 5:00 pm Reading Finish scheduling Process Synchronization: Chapter 6 (8 th Ed) or Chapter 7 (6 th Ed) 1 Scheduling criteria Per processor, or system oriented CPU utilization
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationParallel Computing. Lecture 19: CUDA - I
CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4
More informationSo far. Next: scheduling next process from Wait to Run. 1/31/08 CSE 30341: Operating Systems Principles
So far. Firmware identifies hardware devices present OS bootstrap process: uses the list created by firmware and loads driver modules for each detected hardware. Initializes internal data structures (PCB,
More informationCS3733: Operating Systems
CS3733: Operating Systems Topics: Process (CPU) Scheduling (SGG 5.1-5.3, 6.7 and web notes) Instructor: Dr. Dakai Zhu 1 Updates and Q&A Homework-02: late submission allowed until Friday!! Submit on Blackboard
More informationCourse Syllabus. Operating Systems
Course Syllabus. Introduction - History; Views; Concepts; Structure 2. Process Management - Processes; State + Resources; Threads; Unix implementation of Processes 3. Scheduling Paradigms; Unix; Modeling
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationProcess Scheduling. Copyright : University of Illinois CS 241 Staff
Process Scheduling Copyright : University of Illinois CS 241 Staff 1 Process Scheduling Deciding which process/thread should occupy the resource (CPU, disk, etc) CPU I want to play Whose turn is it? Process
More informationChapter 6: CPU Scheduling. Operating System Concepts 9 th Edition
Chapter 6: CPU Scheduling Silberschatz, Galvin and Gagne 2013 Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Real-Time
More informationXen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016
Xen and the Art of Virtualization CSE-291 (Cloud Computing) Fall 2016 Why Virtualization? Share resources among many uses Allow heterogeneity in environments Allow differences in host and guest Provide
More informationCS140 Operating Systems Midterm Review. Feb. 5 th, 2009 Derrick Isaacson
CS140 Operating Systems Midterm Review Feb. 5 th, 2009 Derrick Isaacson Midterm Quiz Tues. Feb. 10 th In class (4:15-5:30 Skilling) Open book, open notes (closed laptop) Bring printouts You won t have
More informationLecture 3: Introduction to CUDA
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationContainerizing GPU Applications with Docker for Scaling to the Cloud
Containerizing GPU Applications with Docker for Scaling to the Cloud SUBBU RAMA FUTURE OF PACKAGING APPLICATIONS Turns Discrete Computing Resources into a Virtual Supercomputer GPU Mem Mem GPU GPU Mem
More informationAn Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs
An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering
More informationEfficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory
Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More informationChapter 6: CPU Scheduling. Operating System Concepts 9 th Edition
Chapter 6: CPU Scheduling Silberschatz, Galvin and Gagne 2013 Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Real-Time
More informationLecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014
Lecture 13: Memory Consistency + Course-So-Far Review Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes Beggin Madcon (So Dark the Con of Man) 15-418 students tend to
More informationProcesses and Non-Preemptive Scheduling. Otto J. Anshus
Processes and Non-Preemptive Scheduling Otto J. Anshus Threads Processes Processes Kernel An aside on concurrency Timing and sequence of events are key concurrency issues We will study classical OS concurrency
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationFundamental Optimizations
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationProperties of Processes
CPU Scheduling Properties of Processes CPU I/O Burst Cycle Process execution consists of a cycle of CPU execution and I/O wait. CPU burst distribution: CPU Scheduler Selects from among the processes that
More informationSupercomputing, Tutorial S03 New Orleans, Nov 14, 2010
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationPerformance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware
Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware 2010 VMware Inc. All rights reserved About the Speaker Hemant Gaidhani Senior Technical
More informationCSE 160 Lecture 24. Graphical Processing Units
CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of
More informationIntroduction. CS3026 Operating Systems Lecture 01
Introduction CS3026 Operating Systems Lecture 01 One or more CPUs Device controllers (I/O modules) Memory Bus Operating system? Computer System What is an Operating System An Operating System is a program
More informationCritically Missing Pieces on Accelerators: A Performance Tools Perspective
Critically Missing Pieces on Accelerators: A Performance Tools Perspective, Karthik Murthy, Mike Fagan, and John Mellor-Crummey Rice University SC 2013 Denver, CO November 20, 2013 What Is Missing in GPUs?
More informationCPU Scheduling: Objectives
CPU Scheduling: Objectives CPU scheduling, the basis for multiprogrammed operating systems CPU-scheduling algorithms Evaluation criteria for selecting a CPU-scheduling algorithm for a particular system
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More informationOptimizing Efficiency of Deep Learning Workloads through GPU Virtualization
Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization Presenters: Tim Kaldewey Performance Architect, Watson Group Michael Gschwind Chief Engineer ML & DL, Systems Group David K.
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationCS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University
CS 571 Operating Systems Midterm Review Angelos Stavrou, George Mason University Class Midterm: Grading 2 Grading Midterm: 25% Theory Part 60% (1h 30m) Programming Part 40% (1h) Theory Part (Closed Books):
More informationChapter 5: CPU Scheduling
Chapter 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Operating Systems Examples Algorithm Evaluation Chapter 5: CPU Scheduling
More informationAbstract. Testing Parameters. Introduction. Hardware Platform. Native System
Abstract In this paper, we address the latency issue in RT- XEN virtual machines that are available in Xen 4.5. Despite the advantages of applying virtualization to systems, the default credit scheduler
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationIncreasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain
Increasing the efficiency of your -enabled cluster with rcuda Federico Silla Technical University of Valencia Spain Outline Why remote virtualization? How does rcuda work? The performance of the rcuda
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 10 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Chapter 6: CPU Scheduling Basic Concepts
More informationreferences Virtualization services Topics Virtualization
references Virtualization services Virtual machines Intel Virtualization technology IEEE xplorer, May 2005 Comparison of software and hardware techniques for x86 virtualization ASPLOS 2006 Memory resource
More informationCSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA
CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model
More informationSubject Name:Operating system. Subject Code:10EC35. Prepared By:Remya Ramesan and Kala H.S. Department:ECE. Date:
Subject Name:Operating system Subject Code:10EC35 Prepared By:Remya Ramesan and Kala H.S. Department:ECE Date:24-02-2015 UNIT 1 INTRODUCTION AND OVERVIEW OF OPERATING SYSTEM Operating system, Goals of
More informationVirtualization and the Metrics of Performance & Capacity Management
23 S September t b 2011 Virtualization and the Metrics of Performance & Capacity Management Has the world changed? Mark Preston Agenda Reality Check. General Observations Traditional metrics for a non-virtual
More informationIX: A Protected Dataplane Operating System for High Throughput and Low Latency
IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this
More informationChanging landscape of computing at BNL
Changing landscape of computing at BNL Shared Pool and New Users and Tools HTCondor Week May 2018 William Strecker-Kellogg Shared Pool Merging 6 HTCondor Pools into 1 2 What? Current Situation
More informationIVM: A Task-based Shared Memory Programming Model and Runtime System to Enable Uniform Access to CPU-GPU Clusters
IVM: A Task-based Shared Memory Programming Model and Runtime System to Enable Uniform Access to CPU-GPU Clusters Kittisak Sajjapongse, Ruidong Gu, Michela Becchi University of Missouri - Dept. of Electrical
More informationThe rcuda middleware and applications
The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,
More informationNLVMUG 16 maart Display protocols in Horizon
NLVMUG 16 maart 2017 Display protocols in Horizon NLVMUG 16 maart 2017 Display protocols in Horizon Topics Introduction Display protocols - Basics PCoIP vs Blast Extreme Optimizing Monitoring Future Recap
More informationLECTURE 3:CPU SCHEDULING
LECTURE 3:CPU SCHEDULING 1 Outline Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time CPU Scheduling Operating Systems Examples Algorithm Evaluation 2 Objectives
More informationOPERATING SYSTEMS CS3502 Spring Processor Scheduling. Chapter 5
OPERATING SYSTEMS CS3502 Spring 2018 Processor Scheduling Chapter 5 Goals of Processor Scheduling Scheduling is the sharing of the CPU among the processes in the ready queue The critical activities are:
More informationCOLLIN LEE INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM
COLLIN LEE INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM GOAL OF THIS TALK Introduce design ideas and issues for a granular computing
More informationCS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017
CS 471 Operating Systems Yue Cheng George Mason University Fall 2017 Outline o Process concept o Process creation o Process states and scheduling o Preemption and context switch o Inter-process communication
More informationHeidi Poxon Cray Inc.
Heidi Poxon Topics GPU support in the Cray performance tools CUDA proxy MPI support for GPUs (GPU-to-GPU) 2 3 Programming Models Supported for the GPU Goal is to provide whole program analysis for programs
More informationCSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)
CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) Past & Present Have looked at two constraints: Mutual exclusion constraint between two events is a requirement that
More informationParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser
ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationCPU Scheduling: Part I ( 5, SGG) Operating Systems. Autumn CS4023
Operating Systems Autumn 2017-2018 Outline 1 CPU Scheduling: Part I ( 5, SGG) Outline CPU Scheduling: Part I ( 5, SGG) 1 CPU Scheduling: Part I ( 5, SGG) Basic Concepts Typical program behaviour CPU Scheduling:
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationStart of Lecture: February 10, Chapter 6: Scheduling
Start of Lecture: February 10, 2014 1 Reminders Exercise 2 due this Wednesday before class Any questions or comments? 2 Scheduling so far First-Come-First Serve FIFO scheduling in queue without preempting
More informationOperating Systems. Process scheduling. Thomas Ropars.
1 Operating Systems Process scheduling Thomas Ropars thomas.ropars@univ-grenoble-alpes.fr 2018 References The content of these lectures is inspired by: The lecture notes of Renaud Lachaize. The lecture
More informationReal-Time Support for GPU. GPU Management Heechul Yun
Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks
More informationCPU Scheduling. CSE 2431: Introduction to Operating Systems Reading: Chapter 6, [OSC] (except Sections )
CPU Scheduling CSE 2431: Introduction to Operating Systems Reading: Chapter 6, [OSC] (except Sections 6.7.2 6.8) 1 Contents Why Scheduling? Basic Concepts of Scheduling Scheduling Criteria A Basic Scheduling
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationAutomatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP
Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP Automatic NUMA Balancing Agenda What is NUMA, anyway? Automatic NUMA balancing internals
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationCS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010
CS4961 Parallel Programming Lecture 14: Reasoning about Performance Administrative: What s Coming Programming assignment 2 due Friday, 11:59PM Homework assignment out on Tuesday, Oct. 19 and due Monday,
More informationMachine Learning on VMware vsphere with NVIDIA GPUs
Machine Learning on VMware vsphere with NVIDIA GPUs Uday Kurkure, Hari Sivaraman, Lan Vu GPU Technology Conference 2017 2016 VMware Inc. All rights reserved. Gartner Hype Cycle for Emerging Technology
More informationAdvanced CUDA Programming. Dr. Timo Stich
Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput
More informationIntroduction to parallel computing
Introduction to parallel computing 3. Parallel Software Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Last time Parallel hardware Multi-core
More informationChap 7, 8: Scheduling. Dongkun Shin, SKKU
Chap 7, 8: Scheduling 1 Introduction Multiprogramming Multiple processes in the system with one or more processors Increases processor utilization by organizing processes so that the processor always has
More informationThe MOSIX Algorithms for Managing Cluster, Multi-Clusters, GPU Clusters and Clouds
The MOSIX Algorithms for Managing Cluster, Multi-Clusters, GPU Clusters and Clouds Prof. Amnon Barak Department of Computer Science The Hebrew University of Jerusalem http:// www. MOSIX. Org 1 Background
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationLecture Topics. Announcements. Today: Uniprocessor Scheduling (Stallings, chapter ) Next: Advanced Scheduling (Stallings, chapter
Lecture Topics Today: Uniprocessor Scheduling (Stallings, chapter 9.1-9.3) Next: Advanced Scheduling (Stallings, chapter 10.1-10.4) 1 Announcements Self-Study Exercise #10 Project #8 (due 11/16) Project
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 What is an Operating System? What is
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More information