Distributed Systems CS6421

Similar documents
Distributed Systems. CS422/522 Lecture17 17 November 2014

Lecture 11 Hadoop & Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

Introduction to MapReduce

Database Applications (15-415)

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Clustering Lecture 8: MapReduce

Hadoop/MapReduce Computing Paradigm

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

The MapReduce Framework

Hadoop. copyright 2011 Trainologic LTD

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Chapter 5. The MapReduce Programming Model and Implementation

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to MapReduce

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Big Data 7. Resource Management

MI-PDB, MIE-PDB: Advanced Database Systems

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data for Engineers Spring Resource Management

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

HADOOP FRAMEWORK FOR BIG DATA

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Introduction to Data Management CSE 344

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Data Centers and Cloud Computing

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Distributed Computation Models

A brief history on Hadoop

High Performance Computing on MapReduce Programming Framework

CS427 Multicore Architecture and Parallel Computing

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Data Analysis Using MapReduce in Hadoop Environment

Map Reduce & Hadoop Recommended Text:

Data Centers and Cloud Computing. Data Centers

CISC 7610 Lecture 2b The beginnings of NoSQL

Large-Scale Web Applications

CS 61C: Great Ideas in Computer Architecture. MapReduce

Map-Reduce. Marco Mura 2010 March, 31th

Introduction to Data Management CSE 344

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Hadoop MapReduce Framework

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

A BigData Tour HDFS, Ceph and MapReduce

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Distributed Architectures & Microservices. CS 475, Spring 2018 Concurrent & Distributed Systems

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Chapter 1: Introduction 1/29

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Map Reduce Group Meeting

Scalable Tools - Part I Introduction to Scalable Tools

STORM AND LOW-LATENCY PROCESSING.

"Big Data... and Related Topics" John S. Erickson, Ph.D The Rensselaer IDEA Rensselaer Polytechnic Institute

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

TP1-2: Analyzing Hadoop Logs

Clustering Documents. Case Study 2: Document Retrieval

AWS Serverless Architecture Think Big

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Embedded Technosolutions

Cloud Computing & Visualization

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

MapReduce. U of Toronto, 2014

CompSci 516: Database Systems

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

BigData and Map Reduce VITMAC03

An Introduction to Big Data Analysis using Spark

Parallel Computing: MapReduce Jin, Hai

Cloud Computing CS

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Forget about the Clouds, Shoot for the MOON

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Introduction to Hadoop and MapReduce

Distributed Computations MapReduce. adapted from Jeff Dean s slides

MapReduce programming model

Outline. CS-562 Introduction to data analysis using Apache Spark

Dept. Of Computer Science, Colorado State University

Database Systems CSE 414

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Databases 2 (VU) ( / )

Transcription:

Distributed Systems CS6421 Intro to Distributed Systems and the Cloud Prof. Tim Wood

v I teach: Software Engineering, Operating Systems, Sr. Design I like: distributed systems, networks, building cool things Tim Wood!2

Lots of computers! Computers Web Servers 400,000,000 300,000,000 200,000,000 100,000,000 0 1993 1996 1999 2002 2005!3

Lots of problems! How to interconnect them? How to write programs for them? How to secure them? How to manage them? How to get good performance out of them? How to efficiently use resources?!4

This Course CSci 6421 Distributed Systems Prof. Tim Wood - timwood@gwu.edu Prerequisites: - CSci 6212 Algorithms (or undergrad algorithms course) - an undergraduate operating systems course - strong programming abilities, comfort with multiple languages https://gwdistsys18.github.io/!5

Be ready to code We will be writing code both in class and for your homework assignments - This course will be a lot of work! - You only learn by doing! You should be comfortable in at least one of these programming languages: - C, Python, or Java - and you should be ready and eager to learn the others! - and you must understand the basics of the Linux command line!6

Be ready to participate This class will be active and interactive - I will not be lecturing for very long each period You must participate in class - Ask questions, answer questions - Do the in-class exercises - Even if you don t speak english well, that s fine. I will be patient! You may not use your laptop except for times when I tell you to - If a slide has a blue bar at the bottom, you may not use it - If a slide has a green bar, you may use your laptop!7

Semester Outline Topic 1 Intro to Distributed Systems and the Cloud Topic 2 Network Programming (Sockets) Topic 3 Servers (EC2, VMs, Containers) Topic 4 Storage (S3, EBS) Topic 5 Networks (SDN, NFV) Topic 6 Scalable Web Services (nginx, memcache, RDS) Topic 9 Edge and Serverless Computing (AWS Lambda) Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB) (Not necessarily in this order)!8

You will learn to: Course Outcomes Use important technologies common in today s distributed systems - Hadoop, Amazon s cloud, etc. Design coordination protocols that are efficient and correct despite network delays and failures Implement and evaluate the performance of these systems!9

What is a cloud? <spoiler alert> It s not in the sky it s not made of water droplets </spoiler alert>

Some big buildings... Microsoft s Dublin data center!11

...and servers... Giant warehouses - The size of 10 football fields - 10s of thousands of servers - Petabytes of storage 5

...interconnected...!13

...around the world... Undersea Cables - Connect all continents except Antarctica - First deployed in 1850s http://www.cyprusupdates.com http://submarine-cable-map-2013.telegeography.com/!14

...that break a lot. Truck crash shuts down Amazon data center (May 2010) Lightning causes Amazon outages (2009 and 2011) Comcast down after hunter shoots cable (2008) Anchor hits underwater Internet cable (Feb 2012)!15

Or if you re really unlucky... vs!16

Why do we need all of this physical infrastructure?

What is this???!18

Encyclopedias Encyclopedia Britannica - 40,000+ articles - 32 hard bound volumes (32,640 pages) Microsoft Encarta - 60,000+ articles - 1 CD-ROM (700 MB) Wikipedia - 5,512,202 articles (in English) - More than 5 TB of text (about 7,500 CDs)!19

Mega whats? 700MB vs 5TB Mega Million 1024 x 1024 = ~1,000,000 Giga Billion 1024 x 1024 X 1024 = ~1,000,000,000 Tera Trillion 1024 x 1024 x 1024 x 1024= ~1, 000,000,000,000 200 songs vs 1.4 million songs!20

Encyclopedias Wikipedia... in print - 1,763 volumes - (no, this does not exist) Now grown to 2,696 volumes! http://en.wikipedia.org/wiki/wikipedia:size_in_volumes!21

0.01% of Wikipedia!22

Big Data in Perspective Wikipedia - 5TB of text Facebook -???!23

Big Data in Perspective Wikipedia - 5TB of text Facebook - 20TB of photos added each week!24

Big Data in Perspective Facebook - 1,000TB of photos added per year!25

Big Data in Perspective Facebook - 1,000TB of photos added per year Google - 20,000TB of data processed per day 4,000 wikipedias!26

Big Data in Perspective Facebook - 1,000TB of photos added per year Google - 20,000TB of data processed per day - in 2008 4,000 wikipedias!27

How can google process so much information so quickly?

Processing Data Quickly 1. 3 + 6 =? 2. 10-2 =? 3. 5*4-10 =? 4. -1 + 3 =? Buy a faster computer 5. 5*9 =? 6. 8*8 + 3 =? 7. 56 + 2 =? 8. 1 + 4*2 =? Buy another computer 9. 6*4 + 1 =? 10. 3 * 2 * 3 =?!29

Processing Data in PARALLEL 1. 5 + 2 =? 2. 11-2 =? 3. 14-3 =? 4. -1 + 3 =? 5. 5*9 =? 6. 8*8 + 3 =? 7. 56 + 2 =? 8. 1 + 4*2 =? 9. 6*4 + 1 =? 10. 3 * 2 * 3 =? 11. 1 + 2 + 5 =? 12. 6-2 =?!30

Let's try it at scale I have lots of questions I need answered... help me out! 18 questions per page, 40 pages... how long should it take?!31

How could we do better? What problems did we hit? How could we optimize the process? What things can't we prevent?!32

How do we solve problems at large scale?

Distributed Systems Definition? Examples????!34

Distributed Systems A distributed system is one in which components located at networked computers communicate and coordinate their actions only by passing messages. Does this cover all of our examples? Is it too broad? Too narrow?!35

How to count words? How would you count the frequency of each word in a large collection of text documents? - 100M documents are stored in a distributed storage system - You have 50 worker nodes that store and process data Input: lots of text documents Desired output: is 54621 and 34213 are 30931 cat 20021!36

How should we do this at scale?!37

A Real Distributed System Let s consider Map Reduce / Hadoop Data analytic platform originally proposed by Google Hadoop is an open source version developed by Yahoo and others Goal: Make it easy to use a cluster for large scale data processing tasks - Parsing through web documents to determine content - Analyzing genetic data to identify diseases!38

It starts with data 1) HDFS provides an abstract interface to a storage cluster I want data block 3,457 Check Node 2! Name Node Data Node1 Data Node2 Data Node3 Data Node4 Data is replicated 3 times, and accesses are sent to nearby replicas!39

Hadoop programs 2) A user writes a program that takes some input data and processes it in stages: Map the input data to a set of keys and values Shuffle the data so all identical output keys are processed by the same server Reduce all output keys to a final result (more details later)!40

Job scheduling 3) User submits their program as a job to the JobTracker Name Node Job Tracker Task Tracker Data Node Task Tracker Data Node Task Tracker Task Tracker Data Node 4) The Job Tracker splits job into small tasks and distributes them across Task Trackers Prefers to place a task next to the data node with its data!41

Failure handling 5) If a task fails or is too slow, the Job Tracker restarts it on another task tracker. Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker Failures and stragglers are masked!42

Hadoop Components 1 JobTracker - Centralized controller - Assigns tasks to workers, monitors for failures Lots of TaskTrackers - Worker nodes - Execute code and return results to Job Tracker 1 NameNode - Centralized storage controller - Determines where data is placed Lots of DataNodes - Storage nodes - Hold data One server may have multiple roles!43

How to count words? How would you count the frequency of each word in a large collection of text documents? - 100M documents are stored in a distributed storage system - You have 50 worker nodes that store and process data Input: lots of text documents Desired output: is 54621 and 34213 are 30931 cat 20021!44

Map Reduce Flow Input items Map Key-Value list input files (Hello, my name is ) (What is your name..) (What is my favorite ) func() hello, 1 my, 1 name, 1 is, 1 what, 1 is, 1 your, 1 Output name, 2 is, 3 what, 2 Reduce func() Key, list of values hello, [1] my, [1] name, [1,1] is, [1,1,1] what, [1,1] your, [1] Sorting and Shuffling!45

Map Phase - input: data element Map & Reduce - Convert input data into an intermediate result - All map functions can be called independently in parallel - output: {list of keys and values} Shuffle and partition - Sort all outputs by key and combine values into a list - Partition keys to create new Reduce tasks - output: Key, {list of values} Reduce Phase - input: Key, {list of values} - Combine the list of values for each key to produce an output!46

Map/Reduce Hadoop What distributed systems concepts did we see?!47

Map/Reduce Hadoop What distributed systems concepts did we see? Replication for fault tolerance and scalability Data partitioning Scheduling Automated resource management Abstraction layers!48

Common Problems What problems arise in the Hadoop design? - Either problems that it solves or fails to solve???!49

Common Problems Communication protocols Resource management Security Performance Fault tolerance Scalability Which are Systems challenges and which are Algorithmic challenges?!50

Systems and Algorithms This course is about BOTH Systems: - How to implement efficient distributed applications and the mechanisms needed for them to be scalable, secure, and reliable Algorithms: - How to design intelligent distributed applications and the policies needed for them to be scalable, secure, and reliable!51

Distributed System Architectures and Principles

The Cloud!53

Bit Torrent!54

High Performance Computing!55

Heterogeneity Openness Security Failure Handling Concurrency Quality of Service Scalability Transparency Challenges!56

Scalability What are some principles that help with scalability? What are some mechanisms that help with scalability?!57

Scalability Principles: - No single machine has all state - Make decisions based on local or partial information - No single point of failure - No global clock Techniques: - Asynchronous communication - Distribution / Partitioning - Caching - Replication Are these just for performance?!58

Transparency What is transparency? Why is it useful? How do some distributed systems provide transparency?!59

Transparency Examples Transparency is provided by abstractions that hide the complexity of distributed systems Dropbox: transparently replicates your files across several computers MapReduce / Hadoop: abstracts away the complexity of assigning processing tasks to heterogeneous servers Autonomous middleware: automatically allocates resources to applications / VMs in need!60