Today s Agenda. Today s Agenda 9/8/17. Networking and Messaging

Similar documents
CS 326: Operating Systems. Networking. Lecture 17

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Announcements. No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6

No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6

CS519: Computer Networks. Lecture 5, Part 1: Mar 3, 2004 Transport: UDP/TCP demux and flow control / sequencing

ETSF10 Internet Protocols Transport Layer Protocols

Network Management & Monitoring

NET ID. CS519, Prelim (March 17, 2004) NAME: You have 50 minutes to complete the test. 1/17

Reliable File Transfer

PLEASE READ CAREFULLY BEFORE YOU START

Async Programming & Networking. CS 475, Spring 2018 Concurrent & Distributed Systems

TCP Strategies. Keepalive Timer. implementations do not have it as it is occasionally regarded as controversial. between source and destination

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

Inter-Process Communication

14-740: Fundamentals of Computer and Telecommunication Networks

Topics. TCP sliding window protocol TCP PUSH flag TCP slow start Bulk data throughput

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

The evasive speed of your Internet

Distributed Filesystem

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

CS 43: Computer Networks. 19: TCP Flow and Congestion Control October 31, Nov 2, 2018

Lecture 21: Congestion Control" CSE 123: Computer Networks Alex C. Snoeren

Extending the LAN. Context. Info 341 Networking and Distributed Applications. Building up the network. How to hook things together. Media NIC 10/18/10

Computer Networks. ENGG st Semester, 2010 Hayden Kwok-Hay So

Lecture 11 Hadoop & Spark

MASV Accelerator Technology Overview

CS 3640: Introduction to Networks and Their Applications

Network Design Considerations for Grid Computing

Basics of datacommunication

CS519: Computer Networks

CSC 4900 Computer Networks: Introduction

Lecture 2: Internet Structure

Introduction to Networking and Systems Measurements

Computer Communication Networks Midterm Review

Computer Networks and Mobile Systems. Shyam Gollakota

Big Data XML Parsing in Pentaho Data Integration (PDI)

Basic Concepts. CS168, Fall 2014 Sylvia Ratnasamy h<p://inst.eecs.berkeley.edu/~cs168/fa14/

xkcd.com End To End Protocols End to End Protocols This section is about Process to Process communications.

Why Your Application only Uses 10Mbps Even the Link is 1Gbps?

Introduction to computer networking

Homework 2 COP The total number of paths required to reach the global state is 20 edges.

6.033 Lecture 12 3/16/09. Last time: network layer -- how to deliver a packet across a network of multiple links

Ten most important ideas/concepts to take away from Chapter 1

Multi-class Applications for Parallel Usage of a Guaranteed Rate and a Scavenger Service

Overview. Performance metrics - Section 1.5 Direct link networks Hardware building blocks - Section 2.1 Encoding - Section 2.2 Framing - Section 2.

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Unit 5: Computer Networking CS 101, Fall 2018

Question Score 1 / 19 2 / 19 3 / 16 4 / 29 5 / 17 Total / 100

Homework 1. Question 1 - Layering. CSCI 1680 Computer Networks Fonseca

Growth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs.

Basic Reliable Transport Protocols

CSE 123A Computer Networks

A Talari Networks White Paper. Turbo Charging WAN Optimization with WAN Virtualization. A Talari White Paper

Page 1. Goals for Today" Discussion" Example: Reliable File Transfer" CS162 Operating Systems and Systems Programming Lecture 11

The Google File System

Department of EECS - University of California at Berkeley EECS122 - Introduction to Communication Networks - Spring 2005 Final: 5/20/2005

Programming Systems for Big Data

Utilizing Datacenter Networks: Centralized or Distributed Solutions?

CS 428/528 Computer Networks Lecture 01. Yan Wang

Network Software Implementations

Acceleration Systems Technical Overview. September 2014, v1.4

Sena Technologies White Paper: Latency/Throughput Test. Device Servers/Bluetooth-Serial Adapters

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2015

Direct Link Communication I: Basic Techniques. Data Transmission. ignore carrier frequency, coding etc.

Congestion Control End Hosts. CSE 561 Lecture 7, Spring David Wetherall. How fast should the sender transmit data?

Announcements. TAs office hours: Mohamed Grissa: Mohamed Alkalbani:

QoS on Low Bandwidth High Delay Links. Prakash Shende Planning & Engg. Team Data Network Reliance Infocomm

Internetwork. recursive definition point-to-point and multi-access: internetwork. composition of one or more internetworks

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2015

CS 345A Data Mining. MapReduce

STATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak

COMS Introduction to Computers. Networking

Advanced Network Design

Files and Streams

Map Reduce. Yerevan.

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2014

Assignment #1. Csci4211 Spring Due on Feb. 13th, Notes: There are five questions in this assignment. Each question has 10 points.

CS 640 Lecture 4: 09/11/2014

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

CS644 Advanced Networks

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

TCP and BBR. Geoff Huston APNIC

Lecture 11: Packet forwarding

The Internet. Session 3 INST 301 Introduction to Information Science

Packet: Data can be broken into distinct pieces or packets and then reassembled after delivery. Computers on the Internet communicate via packets.

CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007

Ultra high-speed transmission technology for wide area data movement

CS427 Multicore Architecture and Parallel Computing

In this lecture we cover a number of networking issues pertinent to the support of distributed computing. Much of the material is covered in more

cs144 Midterm Review Fall 2010

CSE 373: Data Structures and Algorithms. Memory and Locality. Autumn Shrirang (Shri) Mare

Observing Bufferbloat using mininet

Week 5, continued. This is CS50. Harvard University. Fall Cheng Gong

CS519: Computer Networks. Lecture 5, Part 5: Mar 31, 2004 Queuing and QoS

Network Layer (1) Networked Systems 3 Lecture 8

Reliable Transport I: Concepts and TCP Protocol

Send documentation comments to You must enable FCIP before attempting to configure it on the switch.

ECE 435 Network Engineering Lecture 10

Memory Management. Kevin Webb Swarthmore College February 27, 2018

Transcription:

CS 686: Special Topics in Big Data Networking and Messaging Lecture 7 Today s Agenda Project 1 Updates Networking topics in Big Data Message formats and serialization techniques CS 686: Big Data 2 Today s Agenda Project 1 Updates Networking topics in Big Data Message formats and serialization techniques CS 686: Big Data 3 1

Project 1 Updates There have been a few minor tweaks to the P1 spec Most important: Due Oct 6 Deliverables are posted on Canvas You can submit your design documents via Canvas Protocol buffers and Apache Maven are now installed on the bass cluster Store your chunk data in /home2/<username> CS 686: Big Data 4 A few things to think about How you ll configure your chunk sizes File placement One approach: random.nextint() Replication strategy HDFS has its rack abstraction; we can come up with something simpler CS 686: Big Data 5 Today s Agenda Project 1 Updates Networking topics in Big Data Message formats and serialization techniques CS 686: Big Data 6 2

Network Concerns For the most part, we can rely on the network to do its job and live at a higher level of abstraction Many networking concerns still creep up in distributed systems For instance, do we use TCP or UDP? We still need to think about: Bandwidth Latency CS 686: Big Data 7 Bandwidth Also known as throughput How many bits we can push through the network Trending upward over time slowly 1000 Mbps networks are common in data centers 10 Gbps is gaining some traction Many home internet services are still in the range of 25-100 Mbps If we re going to use plumbing metaphors, it s the size of the pipe CS 686: Big Data 8 Latency How long it takes your bits to get from one point to another Latency has been trending downward overall, but not by leaps and bounds We are limited by the laws of physics here How long the pipe is and how fast data can travel through it Some communication mediums are more prone to latency: Ethernet vs WiFi CS 686: Big Data 9 3

Round Trip Times Another factor to consider is the round-trip time RTTs Another way of looking at latency Communication rarely goes one way Even if you re uploading a file, you d like to confirm that it actually made it over in one piece Testing RTTs: pinging a host CS 686: Big Data 10 Ping Round Trip Times PING alpha.lan (10.0.0.1): 56 data bytes 64 bytes from 10.0.0.1: icmp_seq=0 ttl=64 time=1.133 ms (From my couch to the closet) PING stargate.cs.usfca.edu (138.202.168.21): 56 data bytes 64 bytes from 138.202.168.21: icmp_seq=0 ttl=53 time=15.640 ms (Down the road to USF) PING ruby.cs.colostate.edu (129.82.45.204): 56 data bytes 64 bytes from 129.82.45.204: icmp_seq=0 ttl=50 time=70.991 ms (Fort Collins, Colorado) PING speedtest.shg1.linode.com (139.162.65.37): 56 data bytes 64 bytes from 139.162.65.37: icmp_seq=0 ttl=50 time=124.017 ms (Singapore) CS 686: Big Data 11 Interesting: Physical Distances From my router to USF: 2 miles / 3.2 km From USF to Colorado: ~950 miles / 1609 km From USF to Singapore: ~8,434 miles / 13,573 km https://www.submarinecablemap.com CS 686: Big Data 12 4

Source: https://www.submarinecablemap.com CS 686: Big Data 13 Source: https://www.submarinecablemap.com CS 686: Big Data 14 Bandwidth-Delay Product (1/2) Link capacity (bps) * Round-trip delay time (s) This measures how many bits are in flight How much data is sent before the receiver gets anything A network with a large bandwidth-delay product is called a long fat network, or LFN (elephen) Satellite networking has a large bandwidth-delay product CS 686: Big Data 15 5

Bandwidth-Delay Product (2/2) Important because of its interplay with TCP TCP dynamically tunes its window size If the window is too small, link capacity is wasted If the window is too large, the other end gets overwhelmed! Queuing delays Thinking back to our ping example: My local network (300 Mbit/sec, 1.1 ms) = 0.14 MB Singapore (50 Mbit/sec, 124.0 ms) = 0.78 MB CS 686: Big Data 16 To wrap up Most big data applications operate on top of enough abstraction to almost forget the network exists This is fine until we hit a situation where one of our nodes experiences packet loss or congestion Distributed computations often wait on all participating nodes replies You re only as fast as your slowest worker In the cloud, this can become a major issue Where are your nodes? What is the network like? CS 686: Big Data 17 Today s Agenda Project 1 Updates Networking topics in Big Data Message formats and serialization techniques CS 686: Big Data 18 6

Messaging Previously, we talked a bit about network design The messages you send between components and the network design you choose are closely related For instance, recall our ring overlay: we can get by with just a single message type CS 686: Big Data 19 Simple Messaging (1/3) If you want to send one well-defined message between components, all you need is a fixed-size buffer: byte[] buffer = new byte[25]; Remember: sockets are byte streams, not message streams A much more common format: CS 686: Big Data 20 Simple Messaging (2/3) Once you ve unpacked the message payload, it can contain more fields: This allows for a layered approach: Network code Object creation code Pass through a chain of handlers CS 686: Big Data 21 7

Simple Messaging (3/3) If you don t need advanced features, size prefixed messages work well Exceptions: You d like to avoid reading the entire message before you start processing it You don t even need to process the whole message (perhaps you are forwarding it somewhere else) Distributed systems wire formats have a huge range of features and complexity CS 686: Big Data 22 Serialization Serialization transforms an object, structure, or application state into a format for transmission (and often storage to disk) Most common: binary formats Better performance When you receive a serialized message, transforming it back into its original representation is called deserialization CS 686: Big Data 23 Java Serialization (1/2) Another option is Java s built-in serialization My advice: don t use it for anything but prototyping Python has similar functionality in the pickling module These types of serialization are language-specific, brittle, and can lead to application errors Memory leaks Broken messages between versions CS 686: Big Data 24 8

Java Serialization (2/2) Automated serialization is often not very performant May produce large object graphs Class members may be serialized that you don t need What s the big deal though? Aren t there more important things to worry about? Distributed systems, whether they re storing data or processing it, have to communicate frequently In some applications you ll speed ~50-70% of your CPU time serializing / deserializing messages CS 686: Big Data 25 Alternative Approaches (1/2) One is Protocol Buffers, of course Another is manual serialization In Java: DataOutputStream Write an int for the message size, followed by an int for the message type, then a string, etc This can be extremely error prone Apache Hive SerDes require manual work but automate some of the processes CS 686: Big Data 26 Alternative Approaches (2/2) Apache Thrift is quite similar to Protocol Buffers Designed by Facebook Lets you define your messages using a DSL, generate code for different languages, etc. Also includes an remote procedure call framework XML L JSON Not highly performant but user friendly, widely supported, and easy to debug CS 686: Big Data 27 9

Benchmarks The JVM Serializers project by Eishay Smith makes it easy to compare serialization technologies See: https://github.com/eishay/jvm-serializers Results shown here were collected on my laptop (more or less similar specs to a commodity server at least back when it was new!) macos, Quad Core I7-4770HQ, 16 GB RAM CS 686: Big Data 28 Automatic Serialization + Deserialization (ns) CS 686: Big Data 29 Size (bytes) CS 686: Big Data 30 10

Manual Optimization (ns) CS 686: Big Data 31 11