Optimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa
|
|
- Ira Barnett
- 5 years ago
- Views:
Transcription
1 Optimizing Network Performance in Distributed Machine Learning Luo Mai Chuntao Hong Paolo Costa
2 Machine Learning Successful in many fields Online advertisement Spam filtering Fraud detection Image recognition One of the most important workloads in data centers 2
3 Industry Scale Machine Learning More data, higher accuracy Scales of industry problems 100 Billions samples, 1TBs 1PBs data 10 Billions parameters, 1GBs 1TBs data Distributed execution 100s 1000s machines 3
4 Distributed Machine Learning W 1 W 2 W 3 W 4 Data partitions Model replicas Data partitions Workers
5 Distributed Machine Learning W W W W W W W W gradient Model replicas Data partitions Workers 5
6 Distributed Machine Learning 2. Aggregate gradient for each parameter Parameter server 1. Push gradients Model replicas Data partitions Workers 6
7 Distributed Machine Learning 3. Add gradients to parameters Parameter server W 1 + g 1 W 2 + g 2 W 3 + g 3 W 4 + g 4 4. Pull new parameters Model replicas Data partitions Workers 7
8 Distributed Machine Learning Parameter servers Use multiple PS to avoid bottleneck W 1 W 2 W 3 W 4 Model replicas Data partitions Workers 8
9 Distributed Machine Learning Parameter servers Bottleneck Model replicas Data partitions Workers 9
10 Inbound Congestion Network Core Inbound congestion 10
11 Outbound Congestion Network Core Outbound congestion 11
12 Network Core Congestion Over-subscribed Network Core Congestion in the core in case of over-subscribed networks 12
13 Existing Approaches Over-provisioning network Expensive Limited deployment scale Not available in public clouds Training algorithm Fast network H/W e.g., Infiniband and RoCE 13
14 Existing Approaches Over-provisioning network Expensive Limited deployment scale Not available in public Clouds Asynchronous training algorithm Training efficiency Might not converge Asynchronous training algorithm Network H/W 14
15 Rethinking the Network Design MLNet is a communication layer designed for distributed machine learning systems Improves communication efficiency Orthogonal to existing approaches Training algorithm MLNet Network H/W 15
16 Rethinking the Network Design MLNet is a communication layer designed for distributed machine learning systems Improves communication efficiency Orthogonal to existing approaches Optimizations: Traffic reduction Flow prioritization Training algorithm MLNet Network H/W 16
17 Traffic Reduction 17
18 Traffic Reduction: Key Insight Aggregate the gradients from 6 workers Parameter server g 1 = g 11 + g 12 + g 13 + g 14 + g 15 + g 16 Aggregation is commutative and associative Workers 18
19 Traffic Reduction: Key Insight Aggregate the gradients from 6 workers g 11 + g 12 +g 13 g 14 + g 15 +g 16 Aggregate gradients incrementally does not change the final result 19
20 Traffic Reduction: Design Intercept the push message from the worker to the PS 20
21 Traffic Reduction: Design Redirect the messages to a local worker for partial aggregation 21
22 Traffic Reduction: Design Send the partial results to the PS for final aggregation 22
23 More details on the paper: 1. Traffic reduction in pull request 2. Asynchronous communication 23
24 Traffic Prioritization 24
25 Traffic Prioritization: Key Insight These four TCP flows share a bottleneck link and each of them gets 25% of its bandwidth Job 1 Job 2 Job 3 Job 4 25
26 Traffic Prioritization: Key Insight Job 1 Flow Completion Time (FCT) All flows are delayed! TCP per-flow fairness is harmful in distributed machine learning. Model 1 Model 2 Model 3 Model 4 Job 2 Job 3 Job 4 Average completion time is 4 26
27 Traffic Prioritization: Key Insight MLNet prioritizes the competing flows to minimize the average training time Job 1 Job 2 Job 3 Job 4 27
28 Traffic Prioritization: Key Insight Flow Completion Time (FCT) Job 1 Job 2 Shorten average FCT can largely improve average training time Model 1 Model 2 Model 3 Model 4 Job 3 Job 4 Average completion time is 2 28
29 Evaluation Simulate common network topology in data centers Classic 10Gbps 1024-node data center topology [Fat-Tree, SIGCOMM 08] Training large scale logistic regression 65B parameters, 141TB dataset [Parameter Server, OSDI 14] 800 workers [Parameter Server, OSDI 14] With production trace Data processing rate: uniform(100, 200) MBps Synchronize every 30 seconds 29
30 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Baseline Number of parameter servers Cost-effective Expensive 30
31 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Baseline Rack reduces 48% completion time Number of parameter servers Cost-effective Expensive 31
32 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Baseline Deploying more parameter servers resolve edge network bottlenecks Number of parameter servers Cost-effective Expensive 32
33 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Baseline Deploying more parameter servers to reduce training time (1) uses more machines (2) only possible with non-oversubscribed networks Number of parameter servers Cost-effective Expensive 33
34 Training time (Hours) Traffic Reduction (1:4 Oversubscribed Net.) Worse Better Rack Baseline Number of parameter servers MLNet reduces congestion in the network core. Reduces training time by >70% Cost-effective Expensive 34
35 CDF Traffic Prioritization 20 jobs running in the same cluster Baseline Prioritization Training time (Hours) Everyone finish (almost) at the same time 35
36 CDF Traffic Prioritization Baseline Improve the median by 25% Prioritization Training time (Hours) Delay the tail by 2% Better Worse 36
37 CDF Traffic Prioritization + Traffic Reduction Improve the median by 60% Baseline Priori. + Red. Reduction Training time (Hours) Improve the tail by 54% Better Worse 37
38 More details on the paper: 1. Binary tree aggregation 2. More analysis 38
39 Summary MLNet can significantly improve the network performance of distributed machine learning Traffic reduction Flow prioritization Drop-in solution 39
40 Thanks! 40
41 Discussion Relaxed fault-tolerance? When worker fails, drop that portion of data Adaptive communication Reduce synchronization when network is busy? Hybrid network infrastructure? Some with 10GE, some with 40GE ROCE, etc. Degree of tree? 41
42 Traffic Reduction: Design Is the local aggregator a new bottleneck? Example: 15 workers in a rack 42
43 Traffic Reduction: Design Build a balanced aggregation structure such as a binary tree. Example: 15 workers in a rack Binary tree aggregation 43
44 Training time (Hours) Traffic Reduction Worse Better Rack Binary Baseline Number of parameter servers Cost-effective Expensive 44
45 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Binary Baseline Number of parameter servers Cost-effective Expensive 45
46 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Binary Baseline Binary Tree and Rack reduces 78% and 48% completion time Number of parameter servers Cost-effective Expensive 46
47 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Binary Baseline Deploying more parameter servers resolve edge network bottlenecks Number of parameter servers Cost-effective Expensive 47
48 Training time (Hours) Traffic Reduction (Non-oversubscribed Net.) Worse Better Rack Binary Baseline Number of parameter servers Deploying more parameter servers to reduce training time (1) needs more machines Cost-effective Expensive (2) only possible with non-oversubscribed networks 48
49 Training time (Hours) Traffic Reduction (1:4 Oversubscribed Net.) Worse Better Rack Binary Baseline Number of parameter servers Cost-effective Expensive 49
50 Training time (Hours) Traffic Reduction (1:4 Oversubscribed Net.) Worse Better Rack Binary Baseline Number of parameter servers MLNet reduces congestion in the network core Cost-effective Expensive 50
51 Training time (Hours) Traffic Reduction (1:4 Oversubscribed Net.) Worse Better Rack Binary Baseline Binary is consistently consuming more bandwidth than Rack Number of parameter servers Cost-effective Expensive 51
52 Example: Training a Neural Network G: {g1, g2, g3, g4} W: {w1, w2, w3, w4} W : {w1, w2, w3, w4 } Truth: {cat, dog, cat, } Random init weight Calculate error/gradient Update weights 52
53 Example: Neural Network Model Train W 1 W 4 W 2 W 3 Apply Dog : 99% Cat : 1% 53
54 Model Training Random Init Model Final Model W 4 W 4 Converge W 4 W 2 W 3 W 2 W 3 W 2 W 3 W 1 W 1 W 1 Refine model 54
NaaS Network-as-a-Service in the Cloud
NaaS Network-as-a-Service in the Cloud joint work with Matteo Migliavacca, Peter Pietzuch, and Alexander L. Wolf costa@imperial.ac.uk Motivation Mismatch between app. abstractions & network How the programmers
More informationScaling Distributed Machine Learning with the Parameter Server
Scaling Distributed Machine Learning with the Parameter Server Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su Presented
More informationScaling Distributed Machine Learning
Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale
More informationCS 6453: Parameter Server. Soumya Basu March 7, 2017
CS 6453: Parameter Server Soumya Basu March 7, 2017 What is a Parameter Server? Server for large scale machine learning problems Machine learning tasks in a nutshell: Feature Extraction (1, 1, 1) (2, -1,
More informationHardware Evolution in Data Centers
Hardware Evolution in Data Centers 2004 2008 2011 2000 2013 2014 Trend towards customization Increase work done per dollar (CapEx + OpEx) Paolo Costa Rethinking the Network Stack for Rack-scale Computers
More informationCamdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa
Camdoop Exploiting In-network Aggregation for Big Data Applications costa@imperial.ac.uk joint work with Austin Donnelly, Antony Rowstron, and Greg O Shea (MSR Cambridge) MapReduce Overview Input file
More informationUtilizing Datacenter Networks: Centralized or Distributed Solutions?
Utilizing Datacenter Networks: Centralized or Distributed Solutions? Costin Raiciu Department of Computer Science University Politehnica of Bucharest We ve gotten used to great applications Enabling Such
More informationDeadline Guaranteed Service for Multi- Tenant Cloud Storage Guoxin Liu and Haiying Shen
Deadline Guaranteed Service for Multi- Tenant Cloud Storage Guoxin Liu and Haiying Shen Presenter: Haiying Shen Associate professor *Department of Electrical and Computer Engineering, Clemson University,
More informationA Network-aware Scheduler in Data-parallel Clusters for High Performance
A Network-aware Scheduler in Data-parallel Clusters for High Performance Zhuozhao Li, Haiying Shen and Ankur Sarker Department of Computer Science University of Virginia May, 2018 1/61 Data-parallel clusters
More informationNetAgg: Using Middleboxes for Application-specific On-path Aggregation in Data Centres
: Using Middleboxes for Application-specific On-path regation in Data Centres Luo Mai Lukas Rupprecht Abdul Alim Paolo Costa Matteo Migliavacca Peter Pietzuch Alexander L. Wolf Imperial College London
More informationScalable Distributed Training with Parameter Hub: a whirlwind tour
Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC
More informationPerformance and Scalability with Griddable.io
Performance and Scalability with Griddable.io Executive summary Griddable.io is an industry-leading timeline-consistent synchronized data integration grid across a range of source and target data systems.
More informationSinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley
Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica UC Berkeley Communication is Crucial for Analytics at Scale Performance Facebook analytics
More informationFlat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897
Flat Datacenter Storage Edmund B. Nightingale, Jeremy Elson, et al. 6.S897 Motivation Imagine a world with flat data storage Simple, Centralized, and easy to program Unfortunately, datacenter networks
More informationScaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX
Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Inventing Internet TV Available in more than 190 countries 104+ million subscribers Lots of Streaming == Lots of Traffic
More informationData Center TCP (DCTCP)
Data Center Packet Transport Data Center TCP (DCTCP) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Cloud computing
More informationDecentralized and Distributed Machine Learning Model Training with Actors
Decentralized and Distributed Machine Learning Model Training with Actors Travis Addair Stanford University taddair@stanford.edu Abstract Training a machine learning model with terabytes to petabytes of
More informationSurvey Paper on Traditional Hadoop and Pipelined Map Reduce
International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,
More informationUsers Application Virtual Machine Users Application Virtual Machine Users Application Virtual Machine Private Cloud Users Application Virtual Machine On-Premise Service Providers Private Cloud Users Application
More information15-744: Computer Networking. Data Center Networking II
15-744: Computer Networking Data Center Networking II Overview Data Center Topology Scheduling Data Center Packet Scheduling 2 Current solutions for increasing data center network bandwidth FatTree BCube
More informationvsan Mixed Workloads First Published On: Last Updated On:
First Published On: 03-05-2018 Last Updated On: 03-05-2018 1 1. Mixed Workloads on HCI 1.1.Solution Overview Table of Contents 2 1. Mixed Workloads on HCI 3 1.1 Solution Overview Eliminate the Complexity
More informationAmbry: LinkedIn s Scalable Geo- Distributed Object Store
Ambry: LinkedIn s Scalable Geo- Distributed Object Store Shadi A. Noghabi *, Sriram Subramanian +, Priyesh Narayanan +, Sivabalan Narayanan +, Gopalakrishna Holla +, Mammad Zadeh +, Tianwei Li +, Indranil
More informationPREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING
PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from
More informationWarehouse-Scale Computing
ecture 31 Computer Science 61C Spring 2017 April 7th, 2017 Warehouse-Scale Computing 1 New-School Machine Structures (It s a bit more complicated!) Software Hardware Parallel Requests Assigned to computer
More informationDistributed Filesystem
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
More informationAuthors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.
Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Speaker: Chong Li Department: Applied Health Science Program: Master of Health Informatics 1 Term
More informationData Center TCP (DCTCP)
Data Center TCP (DCTCP) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research Stanford University 1
More informationIntroduction to MapReduce (cont.)
Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations
More informationTowards Deadline Guaranteed Cloud Storage Services Guoxin Liu, Haiying Shen, and Lei Yu
Towards Deadline Guaranteed Cloud Storage Services Guoxin Liu, Haiying Shen, and Lei Yu Presenter: Guoxin Liu Ph.D. Department of Electrical and Computer Engineering, Clemson University, Clemson, USA Computer
More informationFuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc
Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,
More informationCOMP6511A: Large-Scale Distributed Systems. Windows Azure. Lin Gu. Hong Kong University of Science and Technology Spring, 2014
COMP6511A: Large-Scale Distributed Systems Windows Azure Lin Gu Hong Kong University of Science and Technology Spring, 2014 Cloud Systems Infrastructure as a (IaaS): basic compute and storage resources
More informationChelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING
Meeting Today s Datacenter Challenges Produced by Tabor Custom Publishing in conjunction with: 1 Introduction In this era of Big Data, today s HPC systems are faced with unprecedented growth in the complexity
More informationIntroduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson, Nelson Araujo, Dennis Gannon, Wei Lu, and
Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson, Nelson Araujo, Dennis Gannon, Wei Lu, and Jaliya Ekanayake Range in size from edge facilities
More informationApplication of SDN: Load Balancing & Traffic Engineering
Application of SDN: Load Balancing & Traffic Engineering Outline 1 OpenFlow-Based Server Load Balancing Gone Wild Introduction OpenFlow Solution Partitioning the Client Traffic Transitioning With Connection
More informationDeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz
DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz 1 A Typical Facebook Page Modern pages have many components
More informationVMware vsan Network Design-OLD November 03, 2017
VMware vsan Network Design-OLD November 03, 2017 1 Table of Contents 1. Introduction 1.1.Overview 2. Network 2.1.vSAN Network 3. Physical Network Infrastructure 3.1.Data Center Network 3.2.Oversubscription
More informationBatch Processing Basic architecture
Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3
More informationA Scalable, Commodity Data Center Network Architecture
A Scalable, Commodity Data Center Network Architecture B Y M O H A M M A D A L - F A R E S A L E X A N D E R L O U K I S S A S A M I N V A H D A T P R E S E N T E D B Y N A N X I C H E N M A Y. 5, 2 0
More informationCloudian Sizing and Architecture Guidelines
Cloudian Sizing and Architecture Guidelines The purpose of this document is to detail the key design parameters that should be considered when designing a Cloudian HyperStore architecture. The primary
More informationTHE DATACENTER AS A COMPUTER AND COURSE REVIEW
THE DATACENTER A A COMPUTER AND COURE REVIEW George Porter June 8, 2018 ATTRIBUTION These slides are released under an Attribution-NonCommercial-hareAlike 3.0 Unported (CC BY-NC-A 3.0) Creative Commons
More informationInfiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin
Infiniswap Efficient Memory Disaggregation Mosharaf Chowdhury with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow
More informationProgramming Models MapReduce
Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases
More informationCutting the Cord: A Robust Wireless Facilities Network for Data Centers
Cutting the Cord: A Robust Wireless Facilities Network for Data Centers Yibo Zhu, Xia Zhou, Zengbin Zhang, Lin Zhou, Amin Vahdat, Ben Y. Zhao and Haitao Zheng U.C. Santa Barbara, Dartmouth College, U.C.
More informationThe MapReduce Abstraction
The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other
More informationPoseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Hao Zhang Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jianliang Wei, Pengtao Xie,
More informationRevisiting Network Support for RDMA
Revisiting Network Support for RDMA Radhika Mittal 1, Alex Shpiner 3, Aurojit Panda 1, Eitan Zahavi 3, Arvind Krishnamurthy 2, Sylvia Ratnasamy 1, Scott Shenker 1 (1: UC Berkeley, 2: Univ. of Washington,
More informationNetworking in the Hadoop Cluster
Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing Instructors: Nicholas Weaver & Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/ Coherency Tracked by
More informationCoflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan
Coflow Recent Advances and What s Next? Mosharaf Chowdhury University of Michigan Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow Networking Open Source Apache Spark Open
More informationAuto Management for Apache Kafka and Distributed Stateful System in General
Auto Management for Apache Kafka and Distributed Stateful System in General Jiangjie (Becket) Qin Data Infrastructure @LinkedIn GIAC 2017, 12/23/17@Shanghai Agenda Kafka introduction and terminologies
More information70-745: Implementing a Software-Defined Datacenter
70-745: Implementing a Software-Defined Datacenter Target Audience: Candidates for this exam are IT professionals responsible for implementing a software-defined datacenter (SDDC) with Windows Server 2016
More informationInfiniBand-based HPC Clusters
Boosting Scalability of InfiniBand-based HPC Clusters Asaf Wachtel, Senior Product Manager 2010 Voltaire Inc. InfiniBand-based HPC Clusters Scalability Challenges Cluster TCO Scalability Hardware costs
More informationEliminate the Complexity of Multiple Infrastructure Silos
SOLUTION OVERVIEW Eliminate the Complexity of Multiple Infrastructure Silos A common approach to building out compute and storage infrastructure for varying workloads has been dedicated resources based
More informationWhat s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1
What s New in VMware vsphere 4.1 Performance VMware vsphere 4.1 T E C H N I C A L W H I T E P A P E R Table of Contents Scalability enhancements....................................................................
More informationHighly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture
A Cost Effective,, High g Performance,, Highly Scalable, Non-RDMA NVMe Fabric Bob Hansen,, VP System Architecture bob@apeirondata.com Storage Developers Conference, September 2015 Agenda 3 rd Platform
More informationLecture 6: Multicast
Lecture 6: Multicast Challenge: how do we efficiently send messages to a group of machines? Need to revisit all aspects of networking Last time outing This time eliable delivery Ordered delivery Congestion
More informationOracle Exadata: Strategy and Roadmap
Oracle Exadata: Strategy and Roadmap - New Technologies, Cloud, and On-Premises Juan Loaiza Senior Vice President, Database Systems Technologies, Oracle Safe Harbor Statement The following is intended
More informationBUILDING A SCALABLE MOBILE GAME BACKEND IN ELIXIR. Petri Kero CTO / Ministry of Games
BUILDING A SCALABLE MOBILE GAME BACKEND IN ELIXIR Petri Kero CTO / Ministry of Games MOBILE GAME BACKEND CHALLENGES Lots of concurrent users Complex interactions between players Persistent world with frequent
More informationComet Virtualization Code & Design Sprint
Comet Virtualization Code & Design Sprint SDSC September 23-24 Rick Wagner San Diego Supercomputer Center Meeting Goals Build personal connections between the IU and SDSC members of the Comet team working
More informationAttaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter. Glenn Judd Morgan Stanley
Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter Glenn Judd Morgan Stanley 1 Introduction Datacenter computing pervasive Beyond the Internet services domain BigData, Grid Computing,
More informationCutting the Cord: A Robust Wireless Facilities Network for Data Centers
Cutting the Cord: A Robust Wireless Facilities Network for Data Centers Yibo Zhu, Xia Zhou, Zengbin Zhang, Lin Zhou, Amin Vahdat, Ben Y. Zhao and Haitao Zheng U.C. Santa Barbara, Dartmouth College, U.C.
More informationCloud Computing CS
Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part
More informationExpeditus: Congestion-Aware Load Balancing in Clos Data Center Networks
Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks Peng Wang, Hong Xu, Zhixiong Niu, Dongsu Han, Yongqiang Xiong ACM SoCC 2016, Oct 5-7, Santa Clara Motivation Datacenter networks
More informationCamdoop: Exploiting In-network Aggregation for Big Data Applications
: Exploiting In-network Aggregation for Big Data Applications Paolo Costa Austin Donnelly Antony Rowstron Greg O Shea Microsoft Research Cambridge Imperial College London Abstract Large companies like
More informationTensorFlow: A System for Learning-Scale Machine Learning. Google Brain
TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine learning is everywhere This is in large part due to: 1. Invention of more sophisticated machine learning models
More informationD3N: A multi-layer cache for data centers with imbalanced networks
D3N: A multi-layer cache for data centers with imbalanced networks Emine Ugur Kaynar *, Mohammad Hossein Hajkazemi, Mania Abdi, Ata Turk *, Raja R. Sambasivan *, Larry Rudolph, Peter Desnoyers, Orran Krieger
More informationResearch. Eurex NTA Timings 06 June Dennis Lohfert.
Research Eurex NTA Timings 06 June 2013 Dennis Lohfert www.ion.fm 1 Introduction Eurex introduced a new trading platform that represents a radical departure from its previous platform based on OpenVMS
More informationOracle Database Exadata Cloud Service Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE
Oracle Database Exadata Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE Oracle Database Exadata combines the best database with the best cloud platform. Exadata is the culmination of more
More informationLinux Plumbers Conference TCP-NV Congestion Avoidance for Data Centers
Linux Plumbers Conference 2010 TCP-NV Congestion Avoidance for Data Centers Lawrence Brakmo Google TCP Congestion Control Algorithm for utilizing available bandwidth without too many losses No attempt
More informationBUILD THE BUSINESS CASE
BUILD THE BUSINESS CASE Optimize a VDI Project with Converged Compute and Storage table of contents + Calculate Capital and Operational Expenditures for Standard Desktops.... 1 + Capital and Operational
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions
More informationTransport Protocols for Data Center Communication. Evisa Tsolakou Supervisor: Prof. Jörg Ott Advisor: Lect. Pasi Sarolahti
Transport Protocols for Data Center Communication Evisa Tsolakou Supervisor: Prof. Jörg Ott Advisor: Lect. Pasi Sarolahti Contents Motivation and Objectives Methodology Data Centers and Data Center Networks
More informationHedvig as backup target for Veeam
Hedvig as backup target for Veeam Solution Whitepaper Version 1.0 April 2018 Table of contents Executive overview... 3 Introduction... 3 Solution components... 4 Hedvig... 4 Hedvig Virtual Disk (vdisk)...
More informationVirtual WAN Optimization Controllers
Virtual WAN Optimization Controllers vwan Virtual WAN Optimization Controllers accelerate applications, speed data transfers and reduce bandwidth costs using a combination of application, network and protocol
More informationSpecPaxos. James Connolly && Harrison Davis
SpecPaxos James Connolly && Harrison Davis Overview Background Fast Paxos Traditional Paxos Implementations Data Centers Mostly-Ordered-Multicast Network layer Speculative Paxos Protocol Application layer
More informationImproving the Robustness of TCP to Non-Congestion Events
Improving the Robustness of TCP to Non-Congestion Events Presented by : Sally Floyd floyd@acm.org For the Authors: Sumitha Bhandarkar A. L. Narasimha Reddy {sumitha,reddy}@ee.tamu.edu Problem Statement
More informationVirtual WAN Optimization Controllers
acel E RA VA DATAS HEET Virtual WAN Optimization Controllers acelera VA Virtual WAN Optimization Controllers accelerate applications, speed data transfers and reduce bandwidth costs using a combination
More informationDCRoute: Speeding up Inter-Datacenter Traffic Allocation while Guaranteeing Deadlines
DCRoute: Speeding up Inter-Datacenter Traffic Allocation while Guaranteeing Deadlines Mohammad Noormohammadpour, Cauligi S. Raghavendra Ming Hsieh Department of Electrical Engineering University of Southern
More informationLecture 7: Data Center Networks
Lecture 7: Data Center Networks CSE 222A: Computer Communication Networks Alex C. Snoeren Thanks: Nick Feamster Lecture 7 Overview Project discussion Data Centers overview Fat Tree paper discussion CSE
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationCisco Tetration Analytics
Cisco Tetration Analytics Enhanced security and operations with real time analytics Christopher Say (CCIE RS SP) Consulting System Engineer csaychoh@cisco.com Challenges in operating a hybrid data center
More information5 Fundamental Strategies for Building a Data-centered Data Center
5 Fundamental Strategies for Building a Data-centered Data Center June 3, 2014 Ken Krupa, Chief Field Architect Gary Vidal, Solutions Specialist Last generation Reference Data Unstructured OLTP Warehouse
More informationLecture 11: Distributed Training and Communication Protocols. CSE599W: Spring 2018
Lecture 11: Distributed Training and Communication Protocols CSE599W: Spring 2018 Where are we High level Packages User API Programming API Gradient Calculation (Differentiation API) System Components
More informationHigh performance and functionality
IBM Storwize V7000F High-performance, highly functional, cost-effective all-flash storage Highlights Deploys all-flash performance with market-leading functionality Helps lower storage costs with data
More informationMidoNet Scalability Report
MidoNet Scalability Report MidoNet Scalability Report: Virtual Performance Equivalent to Bare Metal 1 MidoNet Scalability Report MidoNet: For virtual performance equivalent to bare metal Abstract: This
More informationLAN design. Chapter 1
LAN design Chapter 1 1 Topics Networks and business needs The 3-level hierarchical network design model Including voice and video over IP in the design Devices at each layer of the hierarchy Cisco switches
More informationOracle Database 10G. Lindsey M. Pickle, Jr. Senior Solution Specialist Database Technologies Oracle Corporation
Oracle 10G Lindsey M. Pickle, Jr. Senior Solution Specialist Technologies Oracle Corporation Oracle 10g Goals Highest Availability, Reliability, Security Highest Performance, Scalability Problem: Islands
More informationIBM Cloud for VMware Solutions NSX Edge Services Gateway Solution Architecture
IBM Cloud for VMware Solutions NSX Edge Services Gateway Solution Architecture Date: 2017-03-29 Version: 1.0 Copyright IBM Corporation 2017 Page 1 of 16 Table of Contents 1 Introduction... 4 1.1 About
More informationEfficient Memory Disaggregation with Infiniswap. Juncheng Gu, Youngmoon Lee, Yiwen Zhang, MosharafChowdhury, Kang G. Shin
Efficient Memory Disaggregation with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, MosharafChowdhury, Kang G. Shin Agenda Motivation and related work Design and system overview Implementation and evaluation
More informationvsan Disaster Recovery November 19, 2017
November 19, 2017 1 Table of Contents 1. Disaster Recovery 1.1.Overview 1.2.vSAN Stretched Clusters and Site Recovery Manager 1.3.vSAN Performance 1.4.Summary 2 1. Disaster Recovery According to the United
More informationDELL EMC VxRAIL vsan STRETCHED CLUSTERS PLANNING GUIDE
WHITE PAPER - DELL EMC VxRAIL vsan STRETCHED CLUSTERS PLANNING GUIDE ABSTRACT This planning guide provides best practices and requirements for using stretched clusters with VxRail appliances. April 2018
More informationCS 61C: Great Ideas in Computer Architecture. MapReduce
CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing
More informationAgenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache
Databases on AWS 2017 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services,
More informationAdvanced Computer Networks. Datacenter TCP
Advanced Computer Networks 263 3501 00 Datacenter TCP Spring Semester 2017 1 Oriana Riva, Department of Computer Science ETH Zürich Today Problems with TCP in the Data Center TCP Incast TPC timeouts Improvements
More informationInformation-Agnostic Flow Scheduling for Commodity Data Centers
Information-Agnostic Flow Scheduling for Commodity Data Centers Wei Bai, Li Chen, Kai Chen, Dongsu Han (KAIST), Chen Tian (NJU), Hao Wang Sing Group @ Hong Kong University of Science and Technology USENIX
More informationDELL EMC READY BUNDLE FOR VIRTUALIZATION WITH VMWARE AND FIBRE CHANNEL INFRASTRUCTURE
DELL EMC READY BUNDLE FOR VIRTUALIZATION WITH VMWARE AND FIBRE CHANNEL INFRASTRUCTURE Design Guide APRIL 0 The information in this publication is provided as is. Dell Inc. makes no representations or warranties
More informationPrepKing. PrepKing
PrepKing Number: 642-961 Passing Score: 800 Time Limit: 120 min File Version: 6.8 http://www.gratisexam.com/ PrepKing 642-961 Exam A QUESTION 1 Which statement best describes the data center core layer?
More informationBest Practices for Validating the Performance of Data Center Infrastructure. Henry He Ixia
Best Practices for Validating the Performance of Data Center Infrastructure Henry He Ixia Game Changers Big data - the world is getting hungrier and hungrier for data 2.5B pieces of content 500+ TB ingested
More informationBSA Sizing Guide v. 1.0
Best Practices & Architecture BSA Sizing Guide v. 1.0 For versions 8.5-8.7 Nitin Maini, Sean Berry 03 May 2016 Table of Contents Purpose & Audience 3 Scope 3 Capacity & Workload Basics 3 BSA Basics...
More informationDELL EMC READY BUNDLE FOR VIRTUALIZATION WITH VMWARE VSAN INFRASTRUCTURE
DELL EMC READY BUNDLE FOR VIRTUALIZATION WITH VMWARE VSAN INFRASTRUCTURE Design Guide APRIL 2017 1 The information in this publication is provided as is. Dell Inc. makes no representations or warranties
More information