April 4-7, 2016 Silicon Valley REAL PERFORMANCE RESULTS WITH VMWARE HORIZON AND VIEWPLANNER Manvender Rawat, NVIDIA Jason K. Lee, NVIDIA Uday Kurkure, VMware Inc.
Overview of VMware Horizon 7 and NVIDIA GRID 2.0 Overview of VMware View Planner AGENDA Blast Protocol Performance and Scaling Results with Knowledge Worker Workloads Blast Extreme (GPU) vs. Blast Extreme (CPU ) vs PCoIP 2
INTRODUCTION 3
VMWARE HORIZON WITH NVIDIA GRID 4
Hardware Virtualization Layer HOW DOES NVIDIA GRID WORK? Virtual PC Virtual PC Virtual PC Virtual Workstation Virtual Workstation Virtual Workstation NVIDIA Graphics Driver NVIDIA Graphics Driver NVIDIA Graphics Driver NVIDIA Quadro Driver NVIDIA Quadro Driver NVIDIA Quadro Driver vgpu vgpu vgpu vgpu vgpu vgpu Hypervisor NVIDIA GRID vgpu manager CPUs Server NVIDIA GPU H.264 Encode NVIDIA GPU 5
HOW IT WORKS TODAY: PCoIP SERVER with GRID GPU CLIENT CPU NIC IP Network Kybd/Mse Decode Render Encode Render Capture GRID GPU WORKLOAD NON GPU WORKLOAD 6
NVIDIA BLAST EXTREME ACCELERATION SERVER with GRID GPU CLIENT CPU NIC IP Network Kybd/Mse Decode Render Encode Render Capture GRID GPU WORKLOAD NON GPU WORKLOAD 7
CPU BASED CAPTURE & ENCODE PIPELINE Load App Execute CPU workload Load GPU data in FB Execute GPU workload Transfer output to sys-mem Transfer Capture image Display to sys-mem Encode Packetize & transmit CPU GPU CPU Increased CPU workload Limited Scalability Multiple Memory Transfers 8
GPU BASED CAPTURE & ENCODE PIPELINE Load App Execute CPU workload Execute Load GPU GPU Execute Capture data Load in FB GPU Display data in FB workload GPU Execute Capture Load GPU Display data in FB workload GPU Execute Capture Load GPU Display data in FB workload GPU Execute Capture Load GPU Display data in FB workload GPU Execute Capture Load GPU Display data in FB workload GPU Execute Capture Load GPU Display data in FB workload GPU Execute Capture Load GPU Display Capture data in FB workload GPU Display workload Packetize Encode & Encode transmit Encode Encode Encode Encode Encode Encode CPU GPU CPU workload offloaded to GPU Increased Scalability Reduced Memory Transfers 9
CHALLENGES IN PERFORMANCE BENCHMARKING Selection of Workloads/Applications Automation Performance Metrics Scaling 10
BENCHMARKING FRAMEWORK VIEWPLANNER Simplicity: Ease of use - Simple Web Interface Expandability: Easily Add New Workloads Elasticity: Ease of Scaling with View and VP 11
BENCHMARKING WITH VIEWPLANNER Select the Workload Applications Provision the desired number of Desktop Virtual Machines with View and ViewPlanner Automatically Launch the Horizon Clients to Connect with the Desktops Automatically Start the workload on each of the desktop VMs Measure the Response times on the remote clients Do the analysis on Response Times and Resource Utilization Do the Scaling Experiments 12
VMWARE VIEWPLANNER 13
USER EXPERIENCE AND RESOURCE UTILIZATION User Experience in ViewPlanner is defined by Frames per Second Response Times Measuring Resource Utilization Nvdia-smi GPU Utilization Built-in VMware vsphere Tools CPU Utilization Memory Usage Network Statistics IO Statistics 14
PERFORMANCE METRICS MEASUREMENT Ramp up Steady State Ramp down For accurate results, the scores are computed in the Steady State Range. Exclude the Ramp Up & Ramp Down Iteration results. 15
PARTNERS AND CUSTOMERS Using ViewPlanner 16
KNOWLEDGE WORKLOAD TEST RESULTS 17
NVIDIA TEST SETUP Virtual Client VMs 64-bit Win7 (SP1) 4vCPU, 4 GB RAM View Client 4.0 Remote Display Protocol Blast Extreme / PCoIP Virtual VDI desktop VMs 64-bit Win7 (SP1) 6vCPU, 14 GB RAM, 50GB HD Horizon View 7.0 agent Storage SuperMicro SYS-2027GR-TRFH Intel Xeon E5-2690 v2 @ 3.00GHz + 2 x Nvidia GRID K1 20 cores (2 x 10-core socket) Intel IvyBridge 256 GB RAM SuperMicro SYS-2028GR-TRT Intel Xeon E5-2698 v3 @ 2.30GHz + 2 x Nvidia GRID M60 32 cores (2 x 16-core socket) Intel Haswell 256 GB RAM 18
ADOBE PHOTOSHOP OPENGL WORKLOAD OVERVIEW 19
ADOBE PHOTOSHOP OPENGL WORKLOAD WORKLOAD Scaling 1VM to 48 VMs 3D intensive app 20
AUTOCAD BENCHMARK USER EXPERIENCE METRIC Assuming user experience is FPS on our NVIDIA AutoCAD benchmark Only one measurement at the moment For AutoCAD anything higher than 20 FPS is awesome but users generally don t notice the difference once you exceed 30 FPS. But once you drop below 10 FPS, the software is going to feel very sluggish and become unusable by the time you hit 5 FPS. 20 fps above is good Autodesk claim this is minimum UX threshold. Below 10fps sluggish 5 fps unusable 21
23:10:57 23:11:59 23:13:01 23:14:03 23:15:05 23:16:07 23:17:09 23:18:11 23:19:12 23:20:14 23:21:16 23:22:18 23:23:20 23:24:22 23:25:24 23:26:25 23:27:27 23:28:29 23:29:31 23:30:33 23:31:35 23:32:37 23:33:39 23:34:40 23:35:42 23:36:44 23:37:46 23:38:48 23:39:50 23:40:52 23:41:54 23:42:56 23:43:58 23:45:00 23:46:02 23:47:04 23:48:06 23:49:08 23:50:10 23:51:12 23:52:14 23:53:16 23:54:18 23:55:20 23:56:22 23:57:23 23:58:25 23:59:27 0:00:29 0:01:31 0:02:32 0:03:34 0:04:36 0:05:37 0:06:39 0:07:41 0:08:43 0:09:45 0:10:46 0:11:48 0:12:50 0:13:51 Lower is better 100 90 80 70 60 50 40 30 20 10 0 AUTOCAD WORKLOAD HOST UTILIZATION Host CPU utilization, NVEnc vs PCoIP Total 10913 vs 10570 : Very similar nvenc pcoip NvEnc Encoder The AutoCAD benchmark doesn t show rapid pixels moving or doesn t contains huge pixels on the screen, NVEnc encoder didn t utilize(around 50% during all benchmark) Both case Blast Extreme(NVEnc GPU) and PCoIP enabled hosts are show similar CPU host utilization 22
19:54:47 19:56:48 19:58:49 20:00:51 20:02:52 20:04:53 20:06:54 20:08:56 20:10:57 20:12:58 20:15:00 20:17:01 20:19:02 20:21:04 20:23:05 20:25:06 20:27:08 20:29:09 20:31:10 20:33:11 20:35:13 20:37:14 20:39:15 20:41:17 20:43:18 20:45:19 20:47:21 20:49:22 20:51:23 20:53:24 20:55:26 20:57:27 20:59:28 21:01:29 21:03:31 21:05:32 21:07:33 21:09:35 21:11:36 21:13:37 21:15:39 21:17:40 21:19:41 21:21:42 21:23:44 21:25:45 21:27:46 21:29:48 21:31:49 21:33:50 Utilization % AUTOCAD WORKLOAD 32 VM GPU UTILIZATION 100 90 80 70 60 50 40 30 20 10 0 Time GPU utilization GPU memory utilization 23
Higher is better FPS BLAST EXTREME(GPU) AVERAGE FPS (UX) AutoCAD AVG FPS, M60-1Q 32VMs Blast Extreme(GPU) vs PCoIP 40.00 36.81 36.49 35.00 30.00 25.00 20.00 15.00 10.00 Minimum fps for UX 5.00 0.00 NvEnc(build3) PCoIP The host DOES NOT saturate CPU resource 100% with 32 VMs current launching we can scale more than 32. Planning testing go further. GPU isn t bottleneck for scaling. 24
VMware Test-bed for NVIDIA GRID on Horizon View Virtual Client VMs 64-bit Win7 (SP1) 1 vcpu, 2 GB RAM, View Client 4.0 Remote Display Protocol Blast Extreme / PCoIP Virtual VDI desktop VMs 64-bit Win7 (SP1) 2vCPU, 4 GB RAM, 40GB HD Horizon View 7.0 agent Storage Dell R730 Intel Haswell CPUs + 2 x NVidia GRID M60 24 cores (2 x 12-core socket) E5-2680 V3 384 GB RAM Dell R730 Intel Haswell CPUs + 2 x NVidia GRID M60 24 cores (2 x 12-core socket) E5-2680 V3 384 GB RAM 25
REMOTE DISPLAY PROTOCOLS IN HORIZON VMware's Remote Display Protocol Blast Extreme Based on a Standard H.264 Exploits NVIDIA GPU Capabilities for Encoding Clients can use any GPU or CPU for decoding. Blast Extreme (GPU) : Blast GPU Uses GPU assist for H264 Encoding NVidia Tesla M60 Virtual Grid in Enterprise Cloud Blast Extreme (CPU) : Blast CPU Does not use hardware GPU assist for H264 Encoding PCoIP and Microsoft RDP 26 CONFIDENTIAL 2
KNOWLEDGE WORKER APPS Knowledge Worker Applications in ViewPlanner 3.6 Office Apps: Word, Excel, PowerPoint, Outlook Adobe Acrobat Reader, Firefox, 7zip Windows Media Player 27
VIEWPLANNER QOS METHODOLOGY Operations are split in Groups Group A:Interactive/fast-running CPU bound operations User expects minimal latencies E.g. Modifying Word, Excel Operations Group B:Long-running slow IO bound operations User can tolerate longer latencies E.g. Saving PowerPoint, Zip/UnZip QoS Criteria: Group A:95 th %ile : 0.70s ( <= 1.0 s) Group B: 95 th %ile: 2.3s ( <= 6.0s) 4/20/2016 28
VP MEASUREMENTS ON REMOTE CLIENTS Measures True Remote User Experience Measurements are done on remote clients Latency Measurement Each Operation s Start Time and End Time are noted on the Remote Client as the Remote Client sees it. Frames/Second Metric for Video Workload Frames Seen by the remote client are counted 4/20/2016 29
Seconds Normalized Latencies wrt PCoIP KNOWLEDGE WORKER WORKLOAD GROUP A LATENCIES Lower is Better 1.20 1.20 1.00 1.00 0.80 0.80 0.60 0.60 0.40 0.40 0.20 0.20 0.00 1.00 8.00 16.00 32.00 48.00 64.00 #of VMs 0.00 BlastGPU BlastCPU PCoIP BlastGPU/PCoIP BlastCPU/PCoIP 30
Seconds Normalized Latencies wrt PCoIP KNOWLEDGE WORKER WORKLOAD GROUP B LATENCIES Lower is Better 3.50 1.20 3.00 1.00 2.50 2.00 1.50 1.00 0.50 0.80 0.60 0.40 0.20 0.00 1.00 8.00 16.00 32.00 48.00 64.00 #of VMs 0.00 BlastGPU BlastCPU PCoIP BlastGPU/PCoIP BlastCPU/PCoIP 31
HEAVY VIDEO WORKLOAD 32
NVIDIA GPU SPECIFICATIONS NVidia GPU Tesla M60 H264 1080p30 Streams: 36 CUDA Cores: 4096/GPU(2x2048) Concurrent Users/GPU: 2-32 VMware Testbed Configuration vgpu Type: GRID M60-0q GPUs/Board: 2 # of Boards: 2 33 CONFIDENTIAL 3
HEAVY VIDEO WORKLOAD Video 720P 2 Minute Duration,10 Iterations Scaling 8 VMs to 48 VMs Performance Metrics Frames/Second CPU Utilization GPU Decodes Video Streams Encodes Blast Extreme Protocol 34 CONFIDENTIAL 3
Cumulative FPS Normalized FPS wrt PCoIP VIDEO WORKLOAD Cumulative Frames/Second Higher is Better 800 700 600 500 400 300 200 100 4.50 4.00 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0 8 16 32 48 #of VMs 0.00 BlastGPU BlastCPU PCoIP BlastGPU/PCoIP BlastCPU/PCoIP Linear (BlastGPU/PCoIP) 35
%CPU Utilization Normalized Average CPU Util. w rt PCoIP VIDEO WORKLOAD Average CPU Utilization Lower is Better 120 3.00 100 2.50 80 2.00 60 1.50 40 1.00 20 0.50 0 8 16 32 48 #of VMs 0.00 BlastGPU BlastCPU PCoIP BlastGPU/PCoIP BlastCPU/PCoIP Linear (BlastGPU/PCoIP) 36
BLAST EXTREME WITH NVIDIA GPUS TAKEWAYS Better User Experience More Frames/Second Lower Latencies: Better Response Times Lower CPU Utilizatio Better Scalability 37
RELATED SESSIONS TUTORIAL S6595 - Benchmarking Graphics Intensive Application on VMware Horizon 6 Using NVIDIA GRID vgpus by ManVender Rawat and Lan VU S6198 - The Latest in High Performance Desktops with VMware Horizon and NVIDIA GRID vgpu by Pat Lee and Luke Wignall 38
April 4-7, 2016 Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join
SECTION DIVIDER OR TRANSITION SLIDE 40
BLAST EXTREME WITH NVIDIA GPUS Blast Extreme with NVIDIA GPUS Better User Experience More frames/seconds Lower Latencies ( Better Response Times) Lower CPU Utilization Better Scalability 41
CONTENT SLIDE: 36 PT BLACK, TREBUCHET FONT BOLD, UP TO 2 LINES MAXIMUM Subtitle: 24 pt, one line maximum Body/bullet text no longer has a bullet icon Use 20 pt font No sub-bullets allowed No more than five bullets; one idea per bullet Example of highlighted text 42
PHTOSHOP OPENGL WORKLOAD 43
NVIDIA BLAST EXTREME ACCELERATION Graphics commands GRID GPU Apps Apps Apps 3D Remote Client H.264 / H.265 streams HW Encoder Context/Display Capture Reduces overall latency Offloads CPU workload to GPU Increases scalability Render Target Front Buffer Improves user experience Framebuffer Lowers N/W bandwidth demand 44