CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Similar documents
/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

/ Cloud Computing. Recitation 2 January 19 & 21, 2016

CS / Cloud Computing. Recitation 11 November 5 th and Nov 8 th, 2013

/ Cloud Computing. Recitation 5 September 27 th, 2016

/ Cloud Computing. Recitation 2 September 5 & 7, 2017

/ Cloud Computing. Recitation 8 October 18, 2016

/ Cloud Computing. Recitation 10 March 22nd, 2016

/ Cloud Computing. Recitation 5 September 26 th, 2017

CS / Cloud Computing. Recitation 9 October 22 nd and 25 th, 2013

/ Cloud Computing. Recitation 5 February 14th, 2017

/ Cloud Computing. Recitation 13 April 12 th 2016

/ Cloud Computing. Recitation 13 April 14 th 2015

The MapReduce Abstraction

Hadoop/MapReduce Computing Paradigm

Database Applications (15-415)

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Accelerate Big Data Insights

Distributed Systems CS6421

Best Practices and Performance Tuning on Amazon Elastic MapReduce

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Hadoop Map Reduce 10/17/2018 1

/ Cloud Computing. Recitation 6 October 2 nd, 2018

Introduction to Data Management CSE 344

Cloud Computing CS

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)?

vrealize Business Standard User Guide

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

ML from Large Datasets

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Map Reduce & Hadoop Recommended Text:

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

MapReduce. Stony Brook University CSE545, Fall 2016

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Database Systems CSE 414

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

5 Fundamental Strategies for Building a Data-centered Data Center

Introduction to MapReduce (cont.)

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

2/26/2017. For instance, consider running Word Count across 20 splits

/ Cloud Computing. Recitation 7 October 10, 2017

MapReduce programming model

High Performance Computing on MapReduce Programming Framework

/ Cloud Computing. Recitation 8 March 1 st, 2016

Introduction to MapReduce

1. Stratified sampling is advantageous when sampling each stratum independently.

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Management and NoSQL Databases

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Clustering Lecture 8: MapReduce

AWS Setup Guidelines

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Distributed Computation Models

Map-Reduce. Marco Mura 2010 March, 31th

2013 AWS Worldwide Public Sector Summit Washington, D.C.

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Cloud Computing & Visualization

CS 61C: Great Ideas in Computer Architecture. MapReduce

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

MapReduce Design Patterns

MapReduce and Friends

Reshaping Text Data for Efficient Processing on Amazon EC2. Gabriela Turcu, Ian Foster, Svetlozar Nestorov

Assignment Tutorial.

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

Next-Generation Cloud Platform

CS 345A Data Mining. MapReduce

vrealize Business for Cloud User Guide

Programming Systems for Big Data

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Data Analysis Using MapReduce in Hadoop Environment

Distributed Filesystem

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

CS / Cloud Computing. Recitation 7 October 7 th and 9 th, 2014

Chapter 5. The MapReduce Programming Model and Implementation

Dept. Of Computer Science, Colorado State University

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Map/Reduce on the Enron dataset

Lecture 20: WSC, Datacenters. Topics: warehouse-scale computing and datacenters (Sections )

/ Cloud Computing. Recitation 7 February 24th & 26th, 2015

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce

CS427 Multicore Architecture and Parallel Computing

vrealize Business for Cloud User Guide

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

CS15-319: Cloud Computing. Lecture 3 Course Project and Amazon AWS Majd Sakr and Mohammad Hammoud

Dr. Chuck Cartledge. 4 Feb. 2015

Activity 03 AWS MapReduce

Parallel Computing: MapReduce Jin, Hai

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Map Reduce Group Meeting

Homework 9: Stock Search Android App with Facebook Post A Mobile Phone Exercise

MySQL in the Cloud Tricks and Tradeoffs

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

Transcription:

CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014

Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions

Last Week Reflection Unit1 : Introduction to cloud computing Quiz 1 Project 1.1 Explore the Wikimedia data set to learn the format Sequentially parse and filter the data Sort the data and save the output to a file Extract & summarize some useful data by answering questions

Update in Unit 1 learn by doing question 4 on page 137 S3 prices history : S3 Historical Price AWS monthly calculator Fixed on OLI for new viewers only Stale price! Current price is 0.003 per GB/month

Review Quiz 1 Hybrid clouds and security Utilization Software service models

Project 1 Checkpoint 1 Introduction to Big Data: Sequential Analysis Not running on AWS Big penalty Submission to s3 bucket Follow guidelines on Piazza @218 Including credentials in code Big penalty Not tagging instances Going over budget Free tier not available for consolidated accounts

This Week s Schedule Complete Module 3 & 4 in Unit 2 Read all pages in modules: Module 3: Data Center Trends Module 4: Data Center Components Complete activities on each page In-module activities are not graded but for self-test If you encounter a bug in the OLI write-up provide feedback at the end of each OLI page Complete Project 1.2 (Elastic MapReduce) Deadline, Sunday, September 14, 11:59pm EST

Why Study Data Centers? Data centers are your new computers! Make sure to read and understand the content of Unit 2 Equipment in a data center Power, cooling, networking How to design data centers What could break All software layers are on top of physical resources

Module 3: Data Center Trends Definition & Origins Infrastructure dedicated to housing computer and networking equipment, including power, cooling, and networking. Growth Size (No. of racks and cabinets) Density Efficiency Servers Server Components Power Cooling Facebook data center

Module 4: Data Center Components IT Equipment Anything that is mounted in a stack Servers : rack-mounted Motherboard Expansion cards Storage Direct attached storage (DAS) Storage area network (SAN) Network attached storage (NAS) Networking Ethernet, protocols, etc. Facilities Server room Power (distribution) Cooling Safety Source: http://www.google.com/about/datacenters/efficiency/internal/

Motivation for MapReduce The problem How many times does each term appear in all books in Hunt Library?

Motivation for MapReduce When the file size is 200MB HashMap: (term, count) Doable on a single machine When the file size is 200TB Large scale data processing Out of memory Slow How would you deal with it? Partition the input? Distribute the work? Coordinate the effort? Aggregate the results?

Motivation for MapReduce Google Example 20+ billion web pages x 20KB = 400+ TB ~1,000 hard drives to store the web 1 computer reads 30-35 MB/sec from disk ~4 months to read the web Takes even more to do something useful with the data! So standard architecture for such problems emerged Cluster of commodity Linux nodes Commodity network (Ethernet) to connect them Google s computational / data manipulation model

Introduction of MapReduce Definition: Programming model for processing large data sets with a parallel, distributed algorithm on a cluster Map: Extract something you care about Group by key: Sort and Shuffle Reduce: Aggregate, summarize, filter or transform Output the result

MapReduce in 15-319/619 In this course we are going to use MapReduce on 2 Platforms: 1. Amazon Elastic MapReduce (this week) 2. Hadoop (later projects) Learn more about Hadoop MapReduce here

MapReduce Example How many times does each term appear in all books in Hunt Library? I heard 6 Apple s!

MapReduce Example We can have two reducers Each reducer can handle one or more keys Orange,1 Blueberry,1 Blueberry,1 Orange,1 Orange,1 Blueberry,1 Apple? Orange? Blueberry?

MapReduce Example We can have three reducers Orange,1 Blueberry,1 Blueberry,1 Orange,1 Orange,1 Blueberry,1 Orange? Apple? Blueberry?

MapReduce Example Orange,1 Blueberry,1 Blueberry,1 Orange,1 Orange,1 Blueberry,1 Blueberry,1 Blueberry,1 Blueberry,1 Orange,1 Orange,1 Orange,1 Apple 6 Blueberry 3 Orange 3 Mapping Shuffling Reducing

Steps of MapReduce

What you should write in EMR Your own mapper Your own reducer

Steps of MapReduce Map Shuffle Reduce Produce final output

Steps of MapReduce Map Prepare input for mappers Split input into parts and assign them to mappers Map Tasks Each mapper will work on its portion of the data Output: key-value pairs Keys are used in Shuffling and Merge to find the Reducer that handles it Values are messages sent from mapper to reducer e.g. (Apple, 1)

Steps of MapReduce Shuffle Group by key: sort the output of mapper by key Split keys and assign them to reducers (based on hashing) Each key will be assigned to exactly one reducer Reduce Each reducer will work on one or more keys Input: mapper s output (key-value pairs) Output: the result needed Different aggregation logic may apply Produce final output Collect all output from reducers Sort them by key

Mapreduce: Environment MapReduce environment takes care of: Partitioning the input data Scheduling the program s execution across a set of machines Perform the Group by key (sort & shuffle) step In practice, this is the bottleneck Handling machine failures Manage required inter-machine communication

Parallelism in MapReduce Mappers run in parallel, creating different intermediate values from input data Reducers also run in parallel, each working on different keys However, reducers cannot start until all mappers finish

Real Example: Friend/Product Suggestion Facebook uses information on your profile, e.g. contact list, messages, direct comments made, page visits, common friends, workplace/residence nearness. This info is dumped into a log or a huge list, a big data. Then the logs are analyzed and put through a weighted matrix analysis and the connections which are above a threshold value are chosen to be shown to the user.

Real Example: Friend/Product Suggestion Generate key-value pairs: Key: Friends pair Value: Friends statistics (e.g. common friends) e.g. (Tom, Sara statistics) Aggregate the statistics value for the same key and output the friends pair if it s above the threshold e.g. (Tom Sara) Mapper HDFS (logs) Mapper Reducers Outputs Mapper

Project 1.2 Elastic MapReduce Processing sequentially can be limiting, we must: aggregate the view counts and generate a daily timeline of page views for each article we are interested in Process a large dataset (~70 GB compressed) Setup an Elastic MapReduce job Flow Write simple Mapper and Reducer in the language of your choice You will understand some of the key aspects of Elastic MapReduce and run an Elastic MapReduce job flow Note: For this checkpoint, assign the tag with Key: Project and Value: 1.2 for all resources

AWS Expenditure Monitor AWS expenses regularly For this week: EMR cost is on top of the EC2 cost of instance and EMR cost is fixed per instance type per hour for example, m2.4xlarge EMR cost is $0.42 on-top-of the spot pricing ($0.14) Check out AWS EMR pricing! Suggestions Terminate your instance when not in use stop still costs EBS money! Use smaller instances to test your code Use spot instances to save cost Use small sample dataset in EMR to decide the instance type and number of instances to use IMPORTANT!

AWS Expenditure Tagging AWS resources Setting up CloudWatch billing alert Using spot instances Please refer to slides in Recitation 2!

Penalties If No tag 10% penalty Expenditure > project budget 10% penalty Expenditure > project budget * 2 100% penalty Copy any code segment from lowest penalty is 200% Other students Previous students Internet (e.g. Stackoverflow) Do not work on code together This is about learning If you do, we will find out and take action

Submitting Code Submit your code on S3, by the deadline Submitting the wrong S3 URL on OLI will be penalized Submitting an incorrectly configured bucket will be penalized (see the instructions in the Project Guidelines page in the Project Primer on OLI) We will manually grade all code Be sure to make your code readable Preface each function with a header that describes what it does Use whitespace well. Indent when using loops or conditional statements Keep each line length to under 80 characters Use descriptive variable names For more detail, please refer to www.cs.cmu.edu/~213/codestyle.html If your code is not well documented and is not readable, we will deduct points Documentation shows us that you know what your code does! The idea is also NOT to comment every line of code

S3 Code Submission Guidelines Make your submission a single zip file (.zip) with name project<no>_andrewid_q<no>. Pack all your code, in ".java", ".py", ".rb", ".sh" or other formats in the zip file. Do NOT submit associated libraries and binary files (.jar and.class files). Please do NOT submit multiple s3 URLs in the text field on OLI. Create a single submission bucket for all of your code, using the folder hierarchy illustrated in the project primer on page 151 on OLI. Do NOT submit AWS Credential Files (aws.credentials) or any other files that contain your AWS Keys within your bucket. 10% penalty Do not make your buckets public.

Questions?

Upcoming Deadlines Project 1: Introduction to Big Data Analysis Elastic MapReduce Checkpoint Available Now Due 9/14/2014 11:59 PM EST Quiz 2: Data Centers Not Available Yet Due 9/18/2014 11:59 PM EST