Distributing Computation to Large GPU Clusters

Size: px
Start display at page:

Download "Distributing Computation to Large GPU Clusters"

Transcription

1 Distributing Computation to Large GPU Clusters

2 What is this about? DiCE: Software library for writing applications scaling to many GPUs and CPUs in a cluster

3 What is this about? DiCE: Software library for writing applications scaling to many GPUs and CPUs in a cluster Used since 2003 in our rendering products... NVIDIA Iray NVIDIA index

4 Why are we presenting this here? DiCE is a base technology in index Clustering / networking /distribution based on DiCE DiCE API exposed by index Distribute pre-computation of data for index Do your own calculations Nothing in DiCE specific to rendering

5 Design Goals Provide a software library which can be used by domain experts to write scalable software for GPU clusters. Not required: low level paralellization / networking knowledge Not specific to special domain (e.g. rendering) Easy to use... High performance, meant for interactive applications Other solutions: OpenMP, MPI, UPC,...

6 Unique Combination of Features Simple programming model Ease of deployment / commodity hardware Unified multi-core and cluster parallelization CUDA support Dynamic clustering Focus on interactive applications Multi-user support Genuine distributed system: All hosts are equal

7 Overview Application C++ API Job System Datastore Networking / Clustering

8 Overview Application C++ API Job System Datastore Networking / Clustering

9 Overview Application C++ API Job System Datastore Networking / Clustering

10 Overview Application C++ API Job System Datastore Networking / Clustering

11 Overview Application C++ API Job System Datastore Networking / Clustering

12 DiCE and index Application index C++ API Job System Datastore Networking / Clustering

13 Networking / Clustering Application C++ API Job System Datastore Networking / Clustering

14 Networking / Clustering Handles cluster building and data transfers Self-organizing, dynamic addition and removal of hosts Tested with up to 1000 hosts Several networking protocols for different environments Provides to application List of hosts in cluster; same on all hosts! Notification for new / leaving hosts List of resources in cluster (GPUs, CPUs)

15 Network Layer: UDP with Multicast Self Organization: Multicast address identifies cluster Multicast beacon packets to detect other hosts Election process to elect one synchronizer Synchronizer organizes hosts Multicast / unicast used for bulk data transfers Especially effective for many hosts One layer of sub-clustering

16 Network Layer: TCP with multicast discovery For networks with low bandwidth multicast UDP multicast layer used for discovering hosts TCP used for all data transport

17 Network Layer: TCP with host list For networks which do not support multicast (e.g. AWS) Host list used for building network Does not have to be complete At least one host Still self-organizing and dynamic TCP used for all data transport

18 Network Layer: Infiniband Native Infiniband with remote DMA (RDMA) Not a standalone network layer IP based layers used for clustering Most communcation over IP layers RDMA used for speeding up bulk data transfer Fastest transmissions > 30 Gbit/s end-to-end

19 Network Layer Not exposed to application! Rely on Datastore and Job System!

20 Datastore Application C++ API Job System Datastore Networking / Clustering

21 Datastore In memory NoSQL datastore for arbitrary C++ objects Store object on some host / retrieve on any host Numeric id / string identify objects Multi-version capability for multi-user Data transport transparent to application

22 Datastore Objects class My_adder { float m_a; int m_b; Your class }; float sum() { return m_a + m_b; }

23 Datastore Objects class My_adder { float m_a; int m_b; Arbitrary member variables }; float sum() { return m_a + m_b; }

24 Datastore Objects class My_adder { float m_a; int m_b; }; float sum() { return m_a + m_b; } Arbitrary member functions

25 Datastore Objects class My_adder : public Element< UUID > { float m_a; int m_b; Derive from base class }; float sum() { return m_a + m_b; }

26 Datastore Objects class My_adder : public Element< UUID > { float m_a; int m_b; void serialize(iserializer* serializer) { serializer->write(m_a); serializer->write(m_b); } }; Implement serialization

27 Datastore Objects class My_adder : public Element< UUID > { float m_a; int m_b; void serialize(iserializer* serializer); void deserialize(ideserializer* deserializer) Implement deserialization { deserializer->read(m_a); deserializer->read(m_b); } };

28 Datastore Objects class My_adder : public Element< UUID > { float m_a; int m_b; void serialize(iserializer* serializer); void deserialize(ideserializer* deserializer); }; register_serializable_class< My_adder >(); Register class

29 Datastore Accessing object will make sure it is available! Per host cache for objects Store more data in cluster than a single host could Configurable max cache size Redundant storage for handling host failure Configurable redundancy level Automatic rebalancing in case of failure

30 Datastore Transactions Important for multi-user operation

31 Datastore Transactions Important for multi-user operation ACID

32 Datastore Transactions Important for multi-user operation ACID A: Transaction abort, commit, automatic abort in case of failure

33 Datastore Transactions Important for multi-user operation ACID A: Transaction abort, commit, automatic abort in case of failure C: Cluster wide locks available

34 Datastore Transactions Important for multi-user operation ACID A: Transaction abort, commit, automatic abort in case of failure C: Cluster wide locks available I: Starting transaction freezes view on datastore

35 Datastore Transactions Important for multi-user operation ACID A: Transaction abort, commit, automatic abort in case of failure C: Cluster wide locks available I: Starting transaction freezes view on datastore D: Redundancy

36 Transaction Isolation T11 A X T7

37 Transaction Isolation Isolation based on multi-version capability T11 A 5 X 9 T7

38 Transaction Isolation Isolation based on multi-version capability Copy-on-write T11 A 5 X 9 X 10 T7

39 Transaction Isolation Isolation based on multi-version capability Copy-on-write T11 A 5 X 9 X 10 T7

40 Transaction Isolation Isolation based on multi-version capability Copy-on-write T11 A 5 X 10

41 Job System Application C++ API Job System Datastore Networking / Clustering

42 Parallelization Model Programmer: split work in n fragments! As independent as possible Small enough but still be efficient Potentially thousands per frame! No apriori knowledge about resources in the cluster! Data transport through datastore Goal: Distribute work over all GPUs / CPUs in cluster

43 Parallelization Model Fragmented Job ~ similar to CUDA kernel Implement C++ class with at least one function: void execute_fragment(int i, int n) { } Ask DiCE to execute job in n fragments DiCE calls execute_fragment() once for every fragment (i = 0 n-1) DiCE assigns CPU core and/or GPU exclusively to fragment Job decides if it needs a GPU Job execution has access to all members and member functions

44 Parallelization Model - Cluster Not a shared memory model! Idea: Split execution and integration of results void execute_remote(int i, int n, OUT){ } Remote host void receive_result(int i, int n, IN) { } Origin host execute_remote()+receive_result() = execute_fragment()

45 Parallelization Model Single Host My_job Scene Camera Framebuf[ ] 0 GPU 1 3 GPU 2 1 GPU 1 4 GPU1 2 GPU 2 5 GPU 2

46 Parallelization Model Single Host 0 GPU 1 My_job Scene Camera Framebuf[ ] 1 GPU 1 2 GPU 2 3 GPU 2 4 GPU1 5 GPU 2 Execute fragment 5 Execute fragment 4 Execute fragment 3 Execute fragment 2 Execute fragment 1 Execute fragment 0

47 Parallelization Model Single Host 0 GPU 1 My_job Scene Camera Framebuf[ ] 1 GPU 1 2 GPU 2 3 GPU 2 4 GPU1 5 GPU 2 Execute fragment 5 Execute fragment 4 Execute fragment 3 Execute fragment 2 Execute fragment 1 Execute fragment 0

48 Parallelization Model Single Host 0 GPU 1 My_job Scene Camera Framebuf[ ] 1 GPU 1 2 GPU 2 3 GPU 2 4 GPU1 5 GPU 2 Execute fragment 5 Execute fragment 4 Execute fragment 3 Execute fragment 2 Execute fragment 1 Execute fragment 0

49 Parallelization Model Single Host 0 GPU 1 My_job Scene Camera Framebuf[ ] 1 GPU 1 2 GPU 2 3 GPU 2 4 GPU1 5 GPU 2 Execute fragment 5 Execute fragment 4 Execute fragment 3 Execute fragment 2 Execute fragment 1 Execute fragment 0

50 Parallelization Model

51 Parallelization Model 3 Hosts Host 1 My_job 0 GPU 1 Host 1 1 GPU 1 Host 2 2 GPU 2 Host 2 Scene Camera Framebuf[ ] 3 GPU 2 Host 1 4 GPU1 Host 3 5 GPU 2 Host 3 Host 2 Host 3

52 Parallelization Model 3 Hosts Host 1 My_job 0 GPU 1 Host 1 1 GPU 1 Host 2 2 GPU 2 Host 2 Scene Camera Framebuf[ ] 3 GPU 2 Host 1 4 GPU1 Host 3 5 GPU 2 Host 3 My_job My_job Scene Camera Scene Camera Host 2 Host 3

53 Parallelization Model 3 Hosts Host 1 My_job Scene Camera Framebuf[ ] My_job 0 GPU 1 Host 1 3 GPU 2 Host 1 1 GPU 1 Host 2 4 GPU1 Host 3 2 GPU 2 Host 2 5 GPU 2 Host 3 Execute fragment 3 Execute fragment 0 My_job Scene Camera Execute remote 5 Execute remote 4 Execute remote 2 Execute remote 1 Scene Camera Host 2 Host 3

54 Parallelization Model 3 Hosts Host 1 My_job Scene Camera Framebuf[ ] My_job 0 GPU 1 Host 1 3 GPU 2 Host 1 1 GPU 1 Host 2 4 GPU1 Host 3 2 GPU 2 Host 2 5 GPU 2 Host 3 Execute fragment 3 Execute fragment 0 My_job Scene Camera Execute remote 5 Execute remote 4 Execute remote 2 Execute remote 1 Scene Camera Host 2 Host 3

55 Parallelization Model 3 Hosts Host 1 My_job Scene Camera Framebuf[ ] My_job 0 GPU 1 Host 1 3 GPU 2 Host 1 1 GPU 1 Host 2 4 GPU1 Host 3 2 GPU 2 Host 2 5 GPU 2 Host 3 Recevie result 5 Receive result 4 Execute fragment 3 Receive result 2 Receive result 1 Execute fragment 0 My_job Scene Camera Execute remote 5 Execute remote 4 Execute remote 2 Execute remote 1 Scene Camera Host 2 Host 3

56 Parallelization Model 3 Hosts

57 Parallelization Model - Hierarchical Viewer Host Compositor Job Compositor Host Compositor Fragment Rendering Job Render Host Rendering Fragment GPU Job GPUs GPU Fragment

58 Other Features More multi-user capabilities (scopes) Futures Global logging system HTTP Server RTMP Video streaming Cloud Bridge...

59 Summary DiCE is a library for writing parallel applications DiCE used in our rendering products Available to those using index

60 Questions?

Mark Falco Oracle Coherence Development

Mark Falco Oracle Coherence Development Achieving the performance benefits of Infiniband in Java Mark Falco Oracle Coherence Development 1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING Meeting Today s Datacenter Challenges Produced by Tabor Custom Publishing in conjunction with: 1 Introduction In this era of Big Data, today s HPC systems are faced with unprecedented growth in the complexity

More information

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014 Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each

More information

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ Networking for Data Acquisition Systems Fabrice Le Goff - 14/02/2018 - ISOTDAQ Outline Generalities The OSI Model Ethernet and Local Area Networks IP and Routing TCP, UDP and Transport Efficiency Networking

More information

The Future of Interconnect Technology

The Future of Interconnect Technology The Future of Interconnect Technology Michael Kagan, CTO HPC Advisory Council Stanford, 2014 Exponential Data Growth Best Interconnect Required 44X 0.8 Zetabyte 2009 35 Zetabyte 2020 2014 Mellanox Technologies

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

STK: DEMYSTIFYING THE CLOUD AND VIRTUAL ENVIRONMENTS

STK: DEMYSTIFYING THE CLOUD AND VIRTUAL ENVIRONMENTS STK: DEMYSTIFYING THE CLOUD AND VIRTUAL ENVIRONMENTS Analytical Graphics Inc. January 2017 CONTENTS Abstract... 3 Introduction... 3 Virtual Environment System Requirements... 3 Hardware Accelerated Graphics...

More information

Scott Meder Senior Regional Sales Manager

Scott Meder Senior Regional Sales Manager www.raima.com Scott Meder Senior Regional Sales Manager scott.meder@raima.com Short Introduction to Raima What is Data Management What are your requirements? How do I make the right decision? - Architecture

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

FaRM: Fast Remote Memory

FaRM: Fast Remote Memory FaRM: Fast Remote Memory Problem Context DRAM prices have decreased significantly Cost effective to build commodity servers w/hundreds of GBs E.g. - cluster with 100 machines can hold tens of TBs of main

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Final Project Writeup

Final Project Writeup Jitu Das Bertha Lam 15-418 Final Project Writeup Summary We built a framework that facilitates running computations across multiple GPUs and displaying results in a web browser. We then created three demos

More information

Large-Scale GPU programming

Large-Scale GPU programming Large-Scale GPU programming Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Assistant Adjunct Professor Computer and Information Science Dept. University

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Solutions for Scalable HPC

Solutions for Scalable HPC Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End

More information

Network bandwidth is a performance bottleneck for cluster computing. Especially for clusters built with SMP machines.

Network bandwidth is a performance bottleneck for cluster computing. Especially for clusters built with SMP machines. Mingzhe Li Motivation Network bandwidth is a performance bottleneck for cluster computing. Especially for clusters built with SMP machines. Multirail network is an efficient way to alleviate this problem

More information

GPUnet: Networking Abstractions for GPU Programs. Author: Andrzej Jackowski

GPUnet: Networking Abstractions for GPU Programs. Author: Andrzej Jackowski Author: Andrzej Jackowski 1 Author: Andrzej Jackowski 2 GPU programming problem 3 GPU distributed application flow 1. recv req Network 4. send repl 2. exec on GPU CPU & Memory 3. get results GPU & Memory

More information

THE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER

THE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER WHITE PAPER THE ZADARA CLOUD An overview of the Zadara Storage Cloud and VPSA Storage Array technology Zadara 6 Venture, Suite 140, Irvine, CA 92618, USA www.zadarastorage.com EXECUTIVE SUMMARY The IT

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

Messaging Overview. Introduction. Gen-Z Messaging

Messaging Overview. Introduction. Gen-Z Messaging Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional

More information

MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced

MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced Sarvani Chadalapaka HPC Administrator University of California

More information

Milestone Systems XProtect Advanced VMS System Architecture. 1

Milestone Systems XProtect Advanced VMS System Architecture.  1 Milestone Systems XProtect Advanced VMS 2014 www.milestonesys.com 1 Content Copyright, trademarks and disclaimer... 3 Introduction... 4 Target audience and purpose... 4 Overall system architecture... 5

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Sharing High-Performance Devices Across Multiple Virtual Machines

Sharing High-Performance Devices Across Multiple Virtual Machines Sharing High-Performance Devices Across Multiple Virtual Machines Preamble What does sharing devices across multiple virtual machines in our title mean? How is it different from virtual networking / NSX,

More information

CockroachDB on DC/OS. Ben Darnell, CTO, Cockroach Labs

CockroachDB on DC/OS. Ben Darnell, CTO, Cockroach Labs CockroachDB on DC/OS Ben Darnell, CTO, Cockroach Labs Agenda A cloud-native database CockroachDB on DC/OS Why CockroachDB Demo! Cloud-Native Database What is Cloud-Native? Horizontally scalable Individual

More information

2008 International ANSYS Conference

2008 International ANSYS Conference 2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,

More information

Planning Manual Deploying DroneTracker System 3.5 on premises

Planning Manual Deploying DroneTracker System 3.5 on premises Planning Manual Deploying DroneTracker System 3.5 on premises PL_Server-Setup_DT3535en 2018 Dedrone ENGLISH This document gives you a overview of what the requirements for the system are and which steps

More information

The OSI Model. Open Systems Interconnection (OSI). Developed by the International Organization for Standardization (ISO).

The OSI Model. Open Systems Interconnection (OSI). Developed by the International Organization for Standardization (ISO). Network Models The OSI Model Open Systems Interconnection (OSI). Developed by the International Organization for Standardization (ISO). Model for understanding and developing computer-to-computer communication

More information

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad Course Name Course Code Class Branch INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad -500 043 COMPUTER SCIENCE AND ENGINEERING TUTORIAL QUESTION BANK 2015-2016 : DISTRIBUTED SYSTEMS

More information

LDetector: A low overhead data race detector for GPU programs

LDetector: A low overhead data race detector for GPU programs LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness

More information

Lecture 27 DASH (Dynamic Adaptive Streaming over HTTP)

Lecture 27 DASH (Dynamic Adaptive Streaming over HTTP) CS 414 Multimedia Systems Design Lecture 27 DASH (Dynamic Adaptive Streaming over HTTP) Klara Nahrstedt Spring 2012 Administrative MP2 posted MP2 Deadline April 7, Saturday, 5pm. APPLICATION Internet Multimedia

More information

IOTIVITY INTRODUCTION

IOTIVITY INTRODUCTION IOTIVITY INTRODUCTION Martin Hsu Intel Open Source Technology Center 1 Content may contain references, logos, trade or service marks that are the property of their respective owners. Agenda Overview Architecture

More information

IBM Europe Announcement ZP , dated November 6, 2007

IBM Europe Announcement ZP , dated November 6, 2007 IBM Europe Announcement ZP07-0484, dated November 6, 2007 IBM WebSphere Front Office for Financial Markets V2.0 and IBM WebSphere MQ Low Latency Messaging V2.0 deliver high speed and high throughput market

More information

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

Distributed Systems Question Bank UNIT 1 Chapter 1 1. Define distributed systems. What are the significant issues of the distributed systems?

Distributed Systems Question Bank UNIT 1 Chapter 1 1. Define distributed systems. What are the significant issues of the distributed systems? UNIT 1 Chapter 1 1. Define distributed systems. What are the significant issues of the distributed systems? 2. What are different application domains of distributed systems? Explain. 3. Discuss the different

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Realtime Signal Processing on Embedded GPUs

Realtime Signal Processing on Embedded GPUs Realtime Signal Processing on Embedded s Dr. Matthias Rosenthal Armin Weiss Dr. Amin Mazloumian Institute of Embedded Systems Realtime Platforms Research Group Zurich University of Applied Sciences Motivation

More information

Speeding up the execution of numerical computations and simulations with rcuda José Duato

Speeding up the execution of numerical computations and simulations with rcuda José Duato Speeding up the execution of numerical computations and simulations with rcuda José Duato Universidad Politécnica de Valencia Spain Outline 1. Introduction to GPU computing 2. What is remote GPU virtualization?

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters CLUSTERING HIVEMQ Building highly available, horizontally scalable MQTT Broker Clusters 12/2016 About this document MQTT is based on a publish/subscribe architecture that decouples MQTT clients and uses

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Introduction to MySQL Cluster: Architecture and Use

Introduction to MySQL Cluster: Architecture and Use Introduction to MySQL Cluster: Architecture and Use Arjen Lentz, MySQL AB (arjen@mysql.com) (Based on an original paper by Stewart Smith, MySQL AB) An overview of the MySQL Cluster architecture, what's

More information

Interactive HPC: Large Scale In-Situ Visualization Using NVIDIA Index in ALYA MultiPhysics

Interactive HPC: Large Scale In-Situ Visualization Using NVIDIA Index in ALYA MultiPhysics www.bsc.es Interactive HPC: Large Scale In-Situ Visualization Using NVIDIA Index in ALYA MultiPhysics Christopher Lux (NV), Vishal Mehta (BSC) and Marc Nienhaus (NV) May 8 th 2017 Barcelona Supercomputing

More information

Technology solution provider focused on Video and Test Orchestration solution Developing a Video Solution for Enterprise / Surveillance Application

Technology solution provider focused on Video and Test Orchestration solution Developing a Video Solution for Enterprise / Surveillance Application Technology solution provider focused on Video and Test Orchestration solution Developing a Video Solution for Enterprise / Surveillance Application INTRODUCTION Any commercial end-user video solution comprises

More information

Networks and distributed computing

Networks and distributed computing Networks and distributed computing Hardware reality lots of different manufacturers of NICs network card has a fixed MAC address, e.g. 00:01:03:1C:8A:2E send packet to MAC address (max size 1500 bytes)

More information

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012 Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA Outline Introduction to Multi-GPU Programming Communication for Single Host, Multiple GPUs Communication for Multiple Hosts, Multiple GPUs

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

IoTivity Big Picture. MyeongGi Jeong Software R&D Center

IoTivity Big Picture. MyeongGi Jeong Software R&D Center IoTivity Big Picture MyeongGi Jeong 2016.11.17 Software R&D Center Contents Overview Features Messaging Security Service Q&A Copyright c 2016 SAMSUNG ELECTRONICS. ALL RIGHTS RESERVED Overview IoTivity?

More information

Mobile AR Hardware Futures

Mobile AR Hardware Futures Copyright Khronos Group, 2010 - Page 1 Mobile AR Hardware Futures Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Two Perspectives NVIDIA - Tegra 2 mobile processor Khronos

More information

Top-Down Network Design, Ch. 7: Selecting Switching and Routing Protocols. Top-Down Network Design. Selecting Switching and Routing Protocols

Top-Down Network Design, Ch. 7: Selecting Switching and Routing Protocols. Top-Down Network Design. Selecting Switching and Routing Protocols Top-Down Network Design Chapter Seven Selecting Switching and Routing Protocols Copyright 2010 Cisco Press & Priscilla Oppenheimer 1 Switching 2 Page 1 Objectives MAC address table Describe the features

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

Cisco Service Advertisement Framework Deployment Guide

Cisco Service Advertisement Framework Deployment Guide Cisco Service Advertisement Framework Deployment Guide What You Will Learn Cisco Service Advertisement Framework (SAF) is a network-based, scalable, bandwidth-efficient approach to service advertisement

More information

Gladius Video Management Software

Gladius Video Management Software Gladius Video Management Software Overview Gladius is a comprehensive, enterprise grade, open platform video management software solution for enterprise and city surveillance IP surveillance installations

More information

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25 Hajussüsteemid MTAT.08.024 Distributed Systems Distributed File Systems (slides: adopted from Meelis Roos DS12 course) 1/25 Examples AFS NFS SMB/CIFS Coda Intermezzo HDFS WebDAV 9P 2/25 Andrew File System

More information

Memcached Design on High Performance RDMA Capable Interconnects

Memcached Design on High Performance RDMA Capable Interconnects Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

Networks and distributed computing

Networks and distributed computing Networks and distributed computing Abstractions provided for networks network card has fixed MAC address -> deliver message to computer on LAN -> machine-to-machine communication -> unordered messages

More information

MySQL Cluster Web Scalability, % Availability. Andrew

MySQL Cluster Web Scalability, % Availability. Andrew MySQL Cluster Web Scalability, 99.999% Availability Andrew Morgan @andrewmorgan www.clusterdb.com Safe Harbour Statement The following is intended to outline our general product direction. It is intended

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc. HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Kepler Overview Mark Ebersole

Kepler Overview Mark Ebersole Kepler Overview Mark Ebersole TFLOPS TFLOPS 3x Performance in a Single Generation 3.5 3 2.5 2 1.5 1 0.5 0 1.25 1 Single Precision FLOPS (SGEMM) 2.90 TFLOPS.89 TFLOPS.36 TFLOPS Xeon E5-2690 Tesla M2090

More information

Assignment - 1 Chap. 1 Wired LAN s

Assignment - 1 Chap. 1 Wired LAN s Assignment - 1 Chap. 1 Wired LAN s 1. (1 Mark) 1. Draw the frame format of Ethernet. 2. What is unicast, multicast and broadcast address? 3. State the purpose of CRC field. 2. (5 Marks) 1. Explain how

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar An Exploration into Object Storage for Exascale Supercomputers Raghu Chandrasekar Agenda Introduction Trends and Challenges Design and Implementation of SAROJA Preliminary evaluations Summary and Conclusion

More information

DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK. UNIT I PART A (2 marks)

DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK. UNIT I PART A (2 marks) DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK Subject Code : IT1001 Subject Name : Distributed Systems Year / Sem : IV / VII UNIT I 1. Define distributed systems. 2. Give examples of distributed systems

More information

Deep Learning Frameworks with Spark and GPUs

Deep Learning Frameworks with Spark and GPUs Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,

More information

Mobile Surveillance Solution

Mobile Surveillance Solution Mobile Surveillance Solution Author: Designation: Company: Basanta Kumar Sethi Sr. Software Engineer Kelltontech Solutions Ltd. Document No: ENG-20150728 Version: V0.03 Page 1 of 7 Introduction The Mobile

More information

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without

More information

HP Routing Switch Series

HP Routing Switch Series HP 12500 Routing Switch Series EVI Configuration Guide Part number: 5998-3419 Software version: 12500-CMW710-R7128 Document version: 6W710-20121130 Legal and notice information Copyright 2012 Hewlett-Packard

More information

Routing and Switching Principles. Lecture#01

Routing and Switching Principles. Lecture#01 Routing and Switching Principles Lecture#01 zeshan.iqbal@uettaxila.edu.pk Text Book Companion website http://web.uettaxila.edu.pk/cms/aut2010/terspbs/index.asp Course Contents Understand the function of

More information

VXLAN Overview: Cisco Nexus 9000 Series Switches

VXLAN Overview: Cisco Nexus 9000 Series Switches White Paper VXLAN Overview: Cisco Nexus 9000 Series Switches What You Will Learn Traditional network segmentation has been provided by VLANs that are standardized under the IEEE 802.1Q group. VLANs provide

More information

Network Protocols. Sarah Diesburg Operating Systems CS 3430

Network Protocols. Sarah Diesburg Operating Systems CS 3430 Network Protocols Sarah Diesburg Operating Systems CS 3430 Protocol An agreement between two parties as to how information is to be transmitted A network protocol abstracts packets into messages Physical

More information

Overview SENTINET 3.1

Overview SENTINET 3.1 Overview SENTINET 3.1 Overview 1 Contents Introduction... 2 Customer Benefits... 3 Development and Test... 3 Production and Operations... 4 Architecture... 5 Technology Stack... 7 Features Summary... 7

More information

Internetwork Expert s CCNP Bootcamp. Hierarchical Campus Network Design Overview

Internetwork Expert s CCNP Bootcamp. Hierarchical Campus Network Design Overview Internetwork Expert s CCNP Bootcamp Hierarchical Campus Network Design Overview http:// Hierarchical Campus Network Design Overview Per Cisco, a three layer hierarchical model to design a modular topology

More information

Circadence Presentation. May 1, Gary Morton/Dave Frick

Circadence Presentation. May 1, Gary Morton/Dave Frick Circadence Presentation May 1, 2012 Gary Morton/Dave Frick Circadence Overview Corporate Overview Privately held/headquartered in Boulder (founded 1993) Originally an on-line gaming company (VR1) Divested

More information

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme STO1193BU A Closer Look at vsan Networking Design and Configuration Considerations Cormac Hogan Andreas Scherr VMworld 2017 Content: Not for publication #VMworld #STO1193BU Disclaimer This presentation

More information

Realtime Signal Processing on Nvidia TX2 using CUDA

Realtime Signal Processing on Nvidia TX2 using CUDA Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss Dr. Amin Mazloumian Dr. Matthias Rosenthal Institute of Embedded Systems High Performance Multimedia Research Group Zurich University of

More information

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik Design challenges of Highperformance and Scalable MPI over InfiniBand Presented by Karthik Presentation Overview In depth analysis of High-Performance and scalable MPI with Reduced Memory Usage Zero Copy

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Cobalt Digital Inc Galen Drive Champaign, IL USA

Cobalt Digital Inc Galen Drive Champaign, IL USA Cobalt Digital White Paper IP Video Transport Protocols Knowing What To Use When and Why Cobalt Digital Inc. 2506 Galen Drive Champaign, IL 61821 USA 1-217-344-1243 www.cobaltdigital.com support@cobaltdigital.com

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011 NVIDIA GPUDirect Technology NVIDIA GPUDirect : Eliminating CPU Overhead Accelerated Communication with Network and Storage Devices Peer-to-Peer Communication Between GPUs Direct access to CUDA memory for

More information

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM Sreeram Potluri, Anshuman Goswami NVIDIA Manjunath Gorentla Venkata, Neena Imam - ORNL SCOPE OF THE WORK Reliance on CPU

More information

SOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS

SOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS SOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS SCALE-OUT SERVER SAN WITH DISTRIBUTED NVME, POWERED BY HIGH-PERFORMANCE NETWORK TECHNOLOGY INTRODUCTION The evolution in data-centric applications,

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

The rcuda middleware and applications

The rcuda middleware and applications The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,

More information

MDHIM: A Parallel Key/Value Store Framework for HPC

MDHIM: A Parallel Key/Value Store Framework for HPC MDHIM: A Parallel Key/Value Store Framework for HPC Hugh Greenberg 7/6/2015 LA-UR-15-25039 HPC Clusters Managed by a job scheduler (e.g., Slurm, Moab) Designed for running user jobs Difficult to run system

More information

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) DELIVERABLE D5.5 Report on ICARUS visualization cluster installation John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) 02 May 2011 NextMuSE 2 Next generation Multi-mechanics Simulation Environment Cluster

More information

Work Queue + Python. A Framework For Scalable Scientific Ensemble Applications

Work Queue + Python. A Framework For Scalable Scientific Ensemble Applications Work Queue + Python A Framework For Scalable Scientific Ensemble Applications Peter Bui, Dinesh Rajan, Badi Abdul-Wahid, Jesus Izaguirre, Douglas Thain University of Notre Dame Distributed Computing Examples

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number

More information