Distributing Computation to Large GPU Clusters
|
|
- Stanley Oliver
- 5 years ago
- Views:
Transcription
1 Distributing Computation to Large GPU Clusters
2 What is this about? DiCE: Software library for writing applications scaling to many GPUs and CPUs in a cluster
3 What is this about? DiCE: Software library for writing applications scaling to many GPUs and CPUs in a cluster Used since 2003 in our rendering products... NVIDIA Iray NVIDIA index
4 Why are we presenting this here? DiCE is a base technology in index Clustering / networking /distribution based on DiCE DiCE API exposed by index Distribute pre-computation of data for index Do your own calculations Nothing in DiCE specific to rendering
5 Design Goals Provide a software library which can be used by domain experts to write scalable software for GPU clusters. Not required: low level paralellization / networking knowledge Not specific to special domain (e.g. rendering) Easy to use... High performance, meant for interactive applications Other solutions: OpenMP, MPI, UPC,...
6 Unique Combination of Features Simple programming model Ease of deployment / commodity hardware Unified multi-core and cluster parallelization CUDA support Dynamic clustering Focus on interactive applications Multi-user support Genuine distributed system: All hosts are equal
7 Overview Application C++ API Job System Datastore Networking / Clustering
8 Overview Application C++ API Job System Datastore Networking / Clustering
9 Overview Application C++ API Job System Datastore Networking / Clustering
10 Overview Application C++ API Job System Datastore Networking / Clustering
11 Overview Application C++ API Job System Datastore Networking / Clustering
12 DiCE and index Application index C++ API Job System Datastore Networking / Clustering
13 Networking / Clustering Application C++ API Job System Datastore Networking / Clustering
14 Networking / Clustering Handles cluster building and data transfers Self-organizing, dynamic addition and removal of hosts Tested with up to 1000 hosts Several networking protocols for different environments Provides to application List of hosts in cluster; same on all hosts! Notification for new / leaving hosts List of resources in cluster (GPUs, CPUs)
15 Network Layer: UDP with Multicast Self Organization: Multicast address identifies cluster Multicast beacon packets to detect other hosts Election process to elect one synchronizer Synchronizer organizes hosts Multicast / unicast used for bulk data transfers Especially effective for many hosts One layer of sub-clustering
16 Network Layer: TCP with multicast discovery For networks with low bandwidth multicast UDP multicast layer used for discovering hosts TCP used for all data transport
17 Network Layer: TCP with host list For networks which do not support multicast (e.g. AWS) Host list used for building network Does not have to be complete At least one host Still self-organizing and dynamic TCP used for all data transport
18 Network Layer: Infiniband Native Infiniband with remote DMA (RDMA) Not a standalone network layer IP based layers used for clustering Most communcation over IP layers RDMA used for speeding up bulk data transfer Fastest transmissions > 30 Gbit/s end-to-end
19 Network Layer Not exposed to application! Rely on Datastore and Job System!
20 Datastore Application C++ API Job System Datastore Networking / Clustering
21 Datastore In memory NoSQL datastore for arbitrary C++ objects Store object on some host / retrieve on any host Numeric id / string identify objects Multi-version capability for multi-user Data transport transparent to application
22 Datastore Objects class My_adder { float m_a; int m_b; Your class }; float sum() { return m_a + m_b; }
23 Datastore Objects class My_adder { float m_a; int m_b; Arbitrary member variables }; float sum() { return m_a + m_b; }
24 Datastore Objects class My_adder { float m_a; int m_b; }; float sum() { return m_a + m_b; } Arbitrary member functions
25 Datastore Objects class My_adder : public Element< UUID > { float m_a; int m_b; Derive from base class }; float sum() { return m_a + m_b; }
26 Datastore Objects class My_adder : public Element< UUID > { float m_a; int m_b; void serialize(iserializer* serializer) { serializer->write(m_a); serializer->write(m_b); } }; Implement serialization
27 Datastore Objects class My_adder : public Element< UUID > { float m_a; int m_b; void serialize(iserializer* serializer); void deserialize(ideserializer* deserializer) Implement deserialization { deserializer->read(m_a); deserializer->read(m_b); } };
28 Datastore Objects class My_adder : public Element< UUID > { float m_a; int m_b; void serialize(iserializer* serializer); void deserialize(ideserializer* deserializer); }; register_serializable_class< My_adder >(); Register class
29 Datastore Accessing object will make sure it is available! Per host cache for objects Store more data in cluster than a single host could Configurable max cache size Redundant storage for handling host failure Configurable redundancy level Automatic rebalancing in case of failure
30 Datastore Transactions Important for multi-user operation
31 Datastore Transactions Important for multi-user operation ACID
32 Datastore Transactions Important for multi-user operation ACID A: Transaction abort, commit, automatic abort in case of failure
33 Datastore Transactions Important for multi-user operation ACID A: Transaction abort, commit, automatic abort in case of failure C: Cluster wide locks available
34 Datastore Transactions Important for multi-user operation ACID A: Transaction abort, commit, automatic abort in case of failure C: Cluster wide locks available I: Starting transaction freezes view on datastore
35 Datastore Transactions Important for multi-user operation ACID A: Transaction abort, commit, automatic abort in case of failure C: Cluster wide locks available I: Starting transaction freezes view on datastore D: Redundancy
36 Transaction Isolation T11 A X T7
37 Transaction Isolation Isolation based on multi-version capability T11 A 5 X 9 T7
38 Transaction Isolation Isolation based on multi-version capability Copy-on-write T11 A 5 X 9 X 10 T7
39 Transaction Isolation Isolation based on multi-version capability Copy-on-write T11 A 5 X 9 X 10 T7
40 Transaction Isolation Isolation based on multi-version capability Copy-on-write T11 A 5 X 10
41 Job System Application C++ API Job System Datastore Networking / Clustering
42 Parallelization Model Programmer: split work in n fragments! As independent as possible Small enough but still be efficient Potentially thousands per frame! No apriori knowledge about resources in the cluster! Data transport through datastore Goal: Distribute work over all GPUs / CPUs in cluster
43 Parallelization Model Fragmented Job ~ similar to CUDA kernel Implement C++ class with at least one function: void execute_fragment(int i, int n) { } Ask DiCE to execute job in n fragments DiCE calls execute_fragment() once for every fragment (i = 0 n-1) DiCE assigns CPU core and/or GPU exclusively to fragment Job decides if it needs a GPU Job execution has access to all members and member functions
44 Parallelization Model - Cluster Not a shared memory model! Idea: Split execution and integration of results void execute_remote(int i, int n, OUT){ } Remote host void receive_result(int i, int n, IN) { } Origin host execute_remote()+receive_result() = execute_fragment()
45 Parallelization Model Single Host My_job Scene Camera Framebuf[ ] 0 GPU 1 3 GPU 2 1 GPU 1 4 GPU1 2 GPU 2 5 GPU 2
46 Parallelization Model Single Host 0 GPU 1 My_job Scene Camera Framebuf[ ] 1 GPU 1 2 GPU 2 3 GPU 2 4 GPU1 5 GPU 2 Execute fragment 5 Execute fragment 4 Execute fragment 3 Execute fragment 2 Execute fragment 1 Execute fragment 0
47 Parallelization Model Single Host 0 GPU 1 My_job Scene Camera Framebuf[ ] 1 GPU 1 2 GPU 2 3 GPU 2 4 GPU1 5 GPU 2 Execute fragment 5 Execute fragment 4 Execute fragment 3 Execute fragment 2 Execute fragment 1 Execute fragment 0
48 Parallelization Model Single Host 0 GPU 1 My_job Scene Camera Framebuf[ ] 1 GPU 1 2 GPU 2 3 GPU 2 4 GPU1 5 GPU 2 Execute fragment 5 Execute fragment 4 Execute fragment 3 Execute fragment 2 Execute fragment 1 Execute fragment 0
49 Parallelization Model Single Host 0 GPU 1 My_job Scene Camera Framebuf[ ] 1 GPU 1 2 GPU 2 3 GPU 2 4 GPU1 5 GPU 2 Execute fragment 5 Execute fragment 4 Execute fragment 3 Execute fragment 2 Execute fragment 1 Execute fragment 0
50 Parallelization Model
51 Parallelization Model 3 Hosts Host 1 My_job 0 GPU 1 Host 1 1 GPU 1 Host 2 2 GPU 2 Host 2 Scene Camera Framebuf[ ] 3 GPU 2 Host 1 4 GPU1 Host 3 5 GPU 2 Host 3 Host 2 Host 3
52 Parallelization Model 3 Hosts Host 1 My_job 0 GPU 1 Host 1 1 GPU 1 Host 2 2 GPU 2 Host 2 Scene Camera Framebuf[ ] 3 GPU 2 Host 1 4 GPU1 Host 3 5 GPU 2 Host 3 My_job My_job Scene Camera Scene Camera Host 2 Host 3
53 Parallelization Model 3 Hosts Host 1 My_job Scene Camera Framebuf[ ] My_job 0 GPU 1 Host 1 3 GPU 2 Host 1 1 GPU 1 Host 2 4 GPU1 Host 3 2 GPU 2 Host 2 5 GPU 2 Host 3 Execute fragment 3 Execute fragment 0 My_job Scene Camera Execute remote 5 Execute remote 4 Execute remote 2 Execute remote 1 Scene Camera Host 2 Host 3
54 Parallelization Model 3 Hosts Host 1 My_job Scene Camera Framebuf[ ] My_job 0 GPU 1 Host 1 3 GPU 2 Host 1 1 GPU 1 Host 2 4 GPU1 Host 3 2 GPU 2 Host 2 5 GPU 2 Host 3 Execute fragment 3 Execute fragment 0 My_job Scene Camera Execute remote 5 Execute remote 4 Execute remote 2 Execute remote 1 Scene Camera Host 2 Host 3
55 Parallelization Model 3 Hosts Host 1 My_job Scene Camera Framebuf[ ] My_job 0 GPU 1 Host 1 3 GPU 2 Host 1 1 GPU 1 Host 2 4 GPU1 Host 3 2 GPU 2 Host 2 5 GPU 2 Host 3 Recevie result 5 Receive result 4 Execute fragment 3 Receive result 2 Receive result 1 Execute fragment 0 My_job Scene Camera Execute remote 5 Execute remote 4 Execute remote 2 Execute remote 1 Scene Camera Host 2 Host 3
56 Parallelization Model 3 Hosts
57 Parallelization Model - Hierarchical Viewer Host Compositor Job Compositor Host Compositor Fragment Rendering Job Render Host Rendering Fragment GPU Job GPUs GPU Fragment
58 Other Features More multi-user capabilities (scopes) Futures Global logging system HTTP Server RTMP Video streaming Cloud Bridge...
59 Summary DiCE is a library for writing parallel applications DiCE used in our rendering products Available to those using index
60 Questions?
Mark Falco Oracle Coherence Development
Achieving the performance benefits of Infiniband in Java Mark Falco Oracle Coherence Development 1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationChelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING
Meeting Today s Datacenter Challenges Produced by Tabor Custom Publishing in conjunction with: 1 Introduction In this era of Big Data, today s HPC systems are faced with unprecedented growth in the complexity
More informationCUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University
GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014 Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each
More informationNetworking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ
Networking for Data Acquisition Systems Fabrice Le Goff - 14/02/2018 - ISOTDAQ Outline Generalities The OSI Model Ethernet and Local Area Networks IP and Routing TCP, UDP and Transport Efficiency Networking
More informationThe Future of Interconnect Technology
The Future of Interconnect Technology Michael Kagan, CTO HPC Advisory Council Stanford, 2014 Exponential Data Growth Best Interconnect Required 44X 0.8 Zetabyte 2009 35 Zetabyte 2020 2014 Mellanox Technologies
More informationLegUp: Accelerating Memcached on Cloud FPGAs
0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are
More informationSTK: DEMYSTIFYING THE CLOUD AND VIRTUAL ENVIRONMENTS
STK: DEMYSTIFYING THE CLOUD AND VIRTUAL ENVIRONMENTS Analytical Graphics Inc. January 2017 CONTENTS Abstract... 3 Introduction... 3 Virtual Environment System Requirements... 3 Hardware Accelerated Graphics...
More informationScott Meder Senior Regional Sales Manager
www.raima.com Scott Meder Senior Regional Sales Manager scott.meder@raima.com Short Introduction to Raima What is Data Management What are your requirements? How do I make the right decision? - Architecture
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationFaRM: Fast Remote Memory
FaRM: Fast Remote Memory Problem Context DRAM prices have decreased significantly Cost effective to build commodity servers w/hundreds of GBs E.g. - cluster with 100 machines can hold tens of TBs of main
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationFinal Project Writeup
Jitu Das Bertha Lam 15-418 Final Project Writeup Summary We built a framework that facilitates running computations across multiple GPUs and displaying results in a web browser. We then created three demos
More informationLarge-Scale GPU programming
Large-Scale GPU programming Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Assistant Adjunct Professor Computer and Information Science Dept. University
More informationParallel and High Performance Computing CSE 745
Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel
More informationSolutions for Scalable HPC
Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End
More informationNetwork bandwidth is a performance bottleneck for cluster computing. Especially for clusters built with SMP machines.
Mingzhe Li Motivation Network bandwidth is a performance bottleneck for cluster computing. Especially for clusters built with SMP machines. Multirail network is an efficient way to alleviate this problem
More informationGPUnet: Networking Abstractions for GPU Programs. Author: Andrzej Jackowski
Author: Andrzej Jackowski 1 Author: Andrzej Jackowski 2 GPU programming problem 3 GPU distributed application flow 1. recv req Network 4. send repl 2. exec on GPU CPU & Memory 3. get results GPU & Memory
More informationTHE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER
WHITE PAPER THE ZADARA CLOUD An overview of the Zadara Storage Cloud and VPSA Storage Array technology Zadara 6 Venture, Suite 140, Irvine, CA 92618, USA www.zadarastorage.com EXECUTIVE SUMMARY The IT
More informationThe Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011
The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities
More informationMessaging Overview. Introduction. Gen-Z Messaging
Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional
More informationMERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced
MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced Sarvani Chadalapaka HPC Administrator University of California
More informationMilestone Systems XProtect Advanced VMS System Architecture. 1
Milestone Systems XProtect Advanced VMS 2014 www.milestonesys.com 1 Content Copyright, trademarks and disclaimer... 3 Introduction... 4 Target audience and purpose... 4 Overall system architecture... 5
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationSharing High-Performance Devices Across Multiple Virtual Machines
Sharing High-Performance Devices Across Multiple Virtual Machines Preamble What does sharing devices across multiple virtual machines in our title mean? How is it different from virtual networking / NSX,
More informationCockroachDB on DC/OS. Ben Darnell, CTO, Cockroach Labs
CockroachDB on DC/OS Ben Darnell, CTO, Cockroach Labs Agenda A cloud-native database CockroachDB on DC/OS Why CockroachDB Demo! Cloud-Native Database What is Cloud-Native? Horizontally scalable Individual
More information2008 International ANSYS Conference
2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,
More informationPlanning Manual Deploying DroneTracker System 3.5 on premises
Planning Manual Deploying DroneTracker System 3.5 on premises PL_Server-Setup_DT3535en 2018 Dedrone ENGLISH This document gives you a overview of what the requirements for the system are and which steps
More informationThe OSI Model. Open Systems Interconnection (OSI). Developed by the International Organization for Standardization (ISO).
Network Models The OSI Model Open Systems Interconnection (OSI). Developed by the International Organization for Standardization (ISO). Model for understanding and developing computer-to-computer communication
More informationINSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad
Course Name Course Code Class Branch INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad -500 043 COMPUTER SCIENCE AND ENGINEERING TUTORIAL QUESTION BANK 2015-2016 : DISTRIBUTED SYSTEMS
More informationLDetector: A low overhead data race detector for GPU programs
LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness
More informationLecture 27 DASH (Dynamic Adaptive Streaming over HTTP)
CS 414 Multimedia Systems Design Lecture 27 DASH (Dynamic Adaptive Streaming over HTTP) Klara Nahrstedt Spring 2012 Administrative MP2 posted MP2 Deadline April 7, Saturday, 5pm. APPLICATION Internet Multimedia
More informationIOTIVITY INTRODUCTION
IOTIVITY INTRODUCTION Martin Hsu Intel Open Source Technology Center 1 Content may contain references, logos, trade or service marks that are the property of their respective owners. Agenda Overview Architecture
More informationIBM Europe Announcement ZP , dated November 6, 2007
IBM Europe Announcement ZP07-0484, dated November 6, 2007 IBM WebSphere Front Office for Financial Markets V2.0 and IBM WebSphere MQ Low Latency Messaging V2.0 deliver high speed and high throughput market
More informationParallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?
Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and
More informationDistributed Systems Question Bank UNIT 1 Chapter 1 1. Define distributed systems. What are the significant issues of the distributed systems?
UNIT 1 Chapter 1 1. Define distributed systems. What are the significant issues of the distributed systems? 2. What are different application domains of distributed systems? Explain. 3. Discuss the different
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationRealtime Signal Processing on Embedded GPUs
Realtime Signal Processing on Embedded s Dr. Matthias Rosenthal Armin Weiss Dr. Amin Mazloumian Institute of Embedded Systems Realtime Platforms Research Group Zurich University of Applied Sciences Motivation
More informationSpeeding up the execution of numerical computations and simulations with rcuda José Duato
Speeding up the execution of numerical computations and simulations with rcuda José Duato Universidad Politécnica de Valencia Spain Outline 1. Introduction to GPU computing 2. What is remote GPU virtualization?
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationCLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters
CLUSTERING HIVEMQ Building highly available, horizontally scalable MQTT Broker Clusters 12/2016 About this document MQTT is based on a publish/subscribe architecture that decouples MQTT clients and uses
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationIntroduction to MySQL Cluster: Architecture and Use
Introduction to MySQL Cluster: Architecture and Use Arjen Lentz, MySQL AB (arjen@mysql.com) (Based on an original paper by Stewart Smith, MySQL AB) An overview of the MySQL Cluster architecture, what's
More informationInteractive HPC: Large Scale In-Situ Visualization Using NVIDIA Index in ALYA MultiPhysics
www.bsc.es Interactive HPC: Large Scale In-Situ Visualization Using NVIDIA Index in ALYA MultiPhysics Christopher Lux (NV), Vishal Mehta (BSC) and Marc Nienhaus (NV) May 8 th 2017 Barcelona Supercomputing
More informationTechnology solution provider focused on Video and Test Orchestration solution Developing a Video Solution for Enterprise / Surveillance Application
Technology solution provider focused on Video and Test Orchestration solution Developing a Video Solution for Enterprise / Surveillance Application INTRODUCTION Any commercial end-user video solution comprises
More informationNetworks and distributed computing
Networks and distributed computing Hardware reality lots of different manufacturers of NICs network card has a fixed MAC address, e.g. 00:01:03:1C:8A:2E send packet to MAC address (max size 1500 bytes)
More informationScalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012
Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA Outline Introduction to Multi-GPU Programming Communication for Single Host, Multiple GPUs Communication for Multiple Hosts, Multiple GPUs
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationIoTivity Big Picture. MyeongGi Jeong Software R&D Center
IoTivity Big Picture MyeongGi Jeong 2016.11.17 Software R&D Center Contents Overview Features Messaging Security Service Q&A Copyright c 2016 SAMSUNG ELECTRONICS. ALL RIGHTS RESERVED Overview IoTivity?
More informationMobile AR Hardware Futures
Copyright Khronos Group, 2010 - Page 1 Mobile AR Hardware Futures Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Two Perspectives NVIDIA - Tegra 2 mobile processor Khronos
More informationTop-Down Network Design, Ch. 7: Selecting Switching and Routing Protocols. Top-Down Network Design. Selecting Switching and Routing Protocols
Top-Down Network Design Chapter Seven Selecting Switching and Routing Protocols Copyright 2010 Cisco Press & Priscilla Oppenheimer 1 Switching 2 Page 1 Objectives MAC address table Describe the features
More informationIntroduction to CUDA
Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations
More informationCisco Service Advertisement Framework Deployment Guide
Cisco Service Advertisement Framework Deployment Guide What You Will Learn Cisco Service Advertisement Framework (SAF) is a network-based, scalable, bandwidth-efficient approach to service advertisement
More informationGladius Video Management Software
Gladius Video Management Software Overview Gladius is a comprehensive, enterprise grade, open platform video management software solution for enterprise and city surveillance IP surveillance installations
More informationDistributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25
Hajussüsteemid MTAT.08.024 Distributed Systems Distributed File Systems (slides: adopted from Meelis Roos DS12 course) 1/25 Examples AFS NFS SMB/CIFS Coda Intermezzo HDFS WebDAV 9P 2/25 Andrew File System
More informationMemcached Design on High Performance RDMA Capable Interconnects
Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan
More informationHigh Performance Computing. Introduction to Parallel Computing
High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials
More informationNetworks and distributed computing
Networks and distributed computing Abstractions provided for networks network card has fixed MAC address -> deliver message to computer on LAN -> machine-to-machine communication -> unordered messages
More informationMySQL Cluster Web Scalability, % Availability. Andrew
MySQL Cluster Web Scalability, 99.999% Availability Andrew Morgan @andrewmorgan www.clusterdb.com Safe Harbour Statement The following is intended to outline our general product direction. It is intended
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.
HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationKepler Overview Mark Ebersole
Kepler Overview Mark Ebersole TFLOPS TFLOPS 3x Performance in a Single Generation 3.5 3 2.5 2 1.5 1 0.5 0 1.25 1 Single Precision FLOPS (SGEMM) 2.90 TFLOPS.89 TFLOPS.36 TFLOPS Xeon E5-2690 Tesla M2090
More informationAssignment - 1 Chap. 1 Wired LAN s
Assignment - 1 Chap. 1 Wired LAN s 1. (1 Mark) 1. Draw the frame format of Ethernet. 2. What is unicast, multicast and broadcast address? 3. State the purpose of CRC field. 2. (5 Marks) 1. Explain how
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationAn Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar
An Exploration into Object Storage for Exascale Supercomputers Raghu Chandrasekar Agenda Introduction Trends and Challenges Design and Implementation of SAROJA Preliminary evaluations Summary and Conclusion
More informationDEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK. UNIT I PART A (2 marks)
DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK Subject Code : IT1001 Subject Name : Distributed Systems Year / Sem : IV / VII UNIT I 1. Define distributed systems. 2. Give examples of distributed systems
More informationDeep Learning Frameworks with Spark and GPUs
Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,
More informationMobile Surveillance Solution
Mobile Surveillance Solution Author: Designation: Company: Basanta Kumar Sethi Sr. Software Engineer Kelltontech Solutions Ltd. Document No: ENG-20150728 Version: V0.03 Page 1 of 7 Introduction The Mobile
More informationActivator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.
Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without
More informationHP Routing Switch Series
HP 12500 Routing Switch Series EVI Configuration Guide Part number: 5998-3419 Software version: 12500-CMW710-R7128 Document version: 6W710-20121130 Legal and notice information Copyright 2012 Hewlett-Packard
More informationRouting and Switching Principles. Lecture#01
Routing and Switching Principles Lecture#01 zeshan.iqbal@uettaxila.edu.pk Text Book Companion website http://web.uettaxila.edu.pk/cms/aut2010/terspbs/index.asp Course Contents Understand the function of
More informationVXLAN Overview: Cisco Nexus 9000 Series Switches
White Paper VXLAN Overview: Cisco Nexus 9000 Series Switches What You Will Learn Traditional network segmentation has been provided by VLANs that are standardized under the IEEE 802.1Q group. VLANs provide
More informationNetwork Protocols. Sarah Diesburg Operating Systems CS 3430
Network Protocols Sarah Diesburg Operating Systems CS 3430 Protocol An agreement between two parties as to how information is to be transmitted A network protocol abstracts packets into messages Physical
More informationOverview SENTINET 3.1
Overview SENTINET 3.1 Overview 1 Contents Introduction... 2 Customer Benefits... 3 Development and Test... 3 Production and Operations... 4 Architecture... 5 Technology Stack... 7 Features Summary... 7
More informationInternetwork Expert s CCNP Bootcamp. Hierarchical Campus Network Design Overview
Internetwork Expert s CCNP Bootcamp Hierarchical Campus Network Design Overview http:// Hierarchical Campus Network Design Overview Per Cisco, a three layer hierarchical model to design a modular topology
More informationCircadence Presentation. May 1, Gary Morton/Dave Frick
Circadence Presentation May 1, 2012 Gary Morton/Dave Frick Circadence Overview Corporate Overview Privately held/headquartered in Boulder (founded 1993) Originally an on-line gaming company (VR1) Divested
More informationDisclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme
STO1193BU A Closer Look at vsan Networking Design and Configuration Considerations Cormac Hogan Andreas Scherr VMworld 2017 Content: Not for publication #VMworld #STO1193BU Disclaimer This presentation
More informationRealtime Signal Processing on Nvidia TX2 using CUDA
Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss Dr. Amin Mazloumian Dr. Matthias Rosenthal Institute of Embedded Systems High Performance Multimedia Research Group Zurich University of
More informationDesign challenges of Highperformance. MPI over InfiniBand. Presented by Karthik
Design challenges of Highperformance and Scalable MPI over InfiniBand Presented by Karthik Presentation Overview In depth analysis of High-Performance and scalable MPI with Reduced Memory Usage Zero Copy
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationCobalt Digital Inc Galen Drive Champaign, IL USA
Cobalt Digital White Paper IP Video Transport Protocols Knowing What To Use When and Why Cobalt Digital Inc. 2506 Galen Drive Champaign, IL 61821 USA 1-217-344-1243 www.cobaltdigital.com support@cobaltdigital.com
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationNVIDIA GPUDirect Technology. NVIDIA Corporation 2011
NVIDIA GPUDirect Technology NVIDIA GPUDirect : Eliminating CPU Overhead Accelerated Communication with Network and Storage Devices Peer-to-Peer Communication Between GPUs Direct access to CUDA memory for
More informationDesigning Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen
Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit
More informationEnabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters
Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationEFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM
EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM Sreeram Potluri, Anshuman Goswami NVIDIA Manjunath Gorentla Venkata, Neena Imam - ORNL SCOPE OF THE WORK Reliance on CPU
More informationSOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS
SOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS SCALE-OUT SERVER SAN WITH DISTRIBUTED NVME, POWERED BY HIGH-PERFORMANCE NETWORK TECHNOLOGY INTRODUCTION The evolution in data-centric applications,
More informationHow to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries
How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationThe rcuda middleware and applications
The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,
More informationMDHIM: A Parallel Key/Value Store Framework for HPC
MDHIM: A Parallel Key/Value Store Framework for HPC Hugh Greenberg 7/6/2015 LA-UR-15-25039 HPC Clusters Managed by a job scheduler (e.g., Slurm, Moab) Designed for running user jobs Difficult to run system
More informationDELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)
DELIVERABLE D5.5 Report on ICARUS visualization cluster installation John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS) 02 May 2011 NextMuSE 2 Next generation Multi-mechanics Simulation Environment Cluster
More informationWork Queue + Python. A Framework For Scalable Scientific Ensemble Applications
Work Queue + Python A Framework For Scalable Scientific Ensemble Applications Peter Bui, Dinesh Rajan, Badi Abdul-Wahid, Jesus Izaguirre, Douglas Thain University of Notre Dame Distributed Computing Examples
More informationCSE 120 Principles of Operating Systems
CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number
More information