Map/Reduce on the Enron dataset

Size: px
Start display at page:

Download "Map/Reduce on the Enron dataset"

Transcription

1 Map/Reduce on the Enron dataset We are going to use EMR on the Enron dataset: This dataset contains 1,227,255 s from Enron employees. The version we use consists of 50 GB of compressed files. Consider the following scenario: Sept 9, 2001 (really), the The New York Times ran an article titled "MARKET WATCH; A Self-Inflicted Wound Aggravates Angst Over Enron" ( ). Someone (your boss?) wants to find out who frequently talked to the press in the days before. You are handed a dump of the server. Technically, this task consists of the following steps: - Put the dataset into S3 (the ugly part, already done for you) - Extract the date/sender/recipient from the data (this is what is described in detail below) - Filter the data to - only consider s between and only consider messages going from ENRON employees to someone not part of the organization - Count the number of foreign interactions and only include accounts that have more than one outside contact that week. To achieve this, you need to create a set of MapReduce jobs. We are going to implement those in Python, using the Hadoop streaming feature also available in EMR. If you are new to Python, check out Here is an example of using Python and Hadoop Streaming on EMR n-and-ngrams-on-aws/ (not all details relevant here). In Hadoop Streaming, Python MapReduce programs are given a part of the input data on the standard system input (stdin) and are expected to write tab-separated tables on the standard output (stdout). Here is a working skeleton for a map or reduce function: #!/usr/bin/env python import sys for line in sys.stdin: line = line.strip().split('\t') # do something with line[0], line[1] etc. print("some_key\tsome Payload")

2 The reducer counterpart starts very similar, but has one important difference: All the values with the same Key from the mapper will follow each other, which allows them to be combined. First, let's start a (small) cluster. Log into AWS at We are going to start with a small cluster. This time, the simple configuration is fine, and most of the defaults can stay the way they are. Some version numbers might be higher/newer than in below screenshot(s), but that should be fine. For hardware configuration, we are going to start with a 2-node cluster of m1.large instances Wait until your cluster has started ("Waiting"). While you are waiting, we need to create a S3 bucket for the output of our analysis.

3 Click "Create Bucket" Select the "US Standard" region and a name for your bucket. This name needs to be globally unique for S3. Then click "Create". Back to EMR: First, we are going to run the data transformation on a small part of the dataset. On your (now hopefully ready soon) cluster page, select "Steps", then "Add Step".

4 Select Step Type "Streaming Program" and give it a name. Further, set Mapper to s3://enron-scripts/enron-etl.py Reducer to cat Input S3 Location to s3://enron-scripts/enron-urls-small.txt And output S3 location to s3://enron-results/t1 (replace enron-results with the S3 bucket name you just created.) Then click "Add".

5 You will see the MapReduce job starting, going from "Pending", to "Running" and then hopefully to "Completed". If something goes wrong, inspect the log files! If all went well, it is time to inspect the results in S3. Right-click the file part and download it to inspect its contents. You will find three columns, separated by tab (\t) character, containing (1) a timestamp, (2) a sender address, and (3) a recipient address, respectively. In other words, the enron-etl.py you just ran extracted from the raw data exactly the information required for the analysis described above, i.e., for your task T02:58:00Z bsitz@mail.utexas.edu steven.p.south@enron.com T02:58:00Z bsitz@mail.utexas.edu steven.p.south@enron.com T02:58:00Z bsitz@mail.utexas.edu steven.p.south@enron.com T16:11:00Z enerfax1@bellsouth.net enerfaxweb@yahoogroups.com T16:11:00Z enerfax1@bellsouth.net enerfaxweb@yahoogroups.com T03:38:00Z linda.w.humphrey@williams.com kbbaker@ppco.com T03:38:00Z linda.w.humphrey@williams.com kevin.coyle@cmenergy.com T03:38:00Z linda.w.humphrey@williams.com kims@kimballenergy.com T03:38:00Z linda.w.humphrey@williams.com kmorrow@bcgas.com

6 In fact, AWS/EMR/Hadoop might choose to use more than one reducer (check the View jobs and View tasks as described below for details), and then the result will be distributed over more than one file. In my latest test, AWS/EMR/Hadoop used two reducers for my job, resulting in two files, i.e., part & part This file (or these files) will be the input for your next MapReduce job as described above. ( Tip: If you specify as Input S3 Location to not a file (as s3://enron-scripts/enron-urls-small.txt in the above example) but a directory (folder), e.g., the s3://enron-results/t1 result folder you used above, AWS/EMR/Hadoop will automatically iterate of all files in the directory (folder), i.e., you do not need to concatenate them yourself in any way.) Create a mapper.py and a reducer.py script, upload them to your S3 bucket, point to them in the Step "Streaming" step creation and run them. See the skeleton further up for an example. The Mapper is expected to output a key and values separated by a tab (\t) character. As mentioned in the slides, the Mapper typically filters records and outputs them with the common key, and the reducers read the files with the common key and output an aggregation. Here are examples for Hadoop Streaming Mappers and Reducers doing Wordcount (text files are available at /Wordcount-Mapper.py and ): Mapper #!/usr/bin/env python import sys # line-wise iterate over standard input (stdin) for line in sys.stdin: # split line (after stripping off any leading/trailing whitespace) # on whitespaces into "words" words = line.strip().split() # iterate over all words of a line for word in words: # print word (after stripping off any leading/trailing # whitespace) as key and number "1" as value # as tab-('\t')-separated (key,value) pair print(word.strip()+ "\t1") Reducer #!/usr/bin/env python import sys # initialize variables current_count = 0 current_word = ""

7 # line-wise iterate over standard input (stdin) # (recall, each line is expected to consist of a tab-('\t')-separated # (key,value) pair) for line in sys.stdin: # split line (after stripping off any leading/trailing whitespace) # on tab ('\t') into key & value line = line.strip().split('\t') # sanity check: did we indeed get exactly two parts (key & value)? # if not, skip line and with next line if len(line)!= 2: # extract key key = line[0] # new (next) key # (recall, keys are expected to arrive in sorted order) if (key!= current_word): if (current_count > 0): # print previous key and aggregated count # as tab-('\t')-separated (key,value) pair print(current_word + '\t' + str(current_count)) # reset counter to 0 and recall new key current_count = 0 current_word = key # increment count by 1 current_count += 1 if (current_count > 0): # print last key and its aggregated count # as tab-('\t')-separated (key,value) pair print(current_word + '\t' + str(current_count)) If anything goes wrong (which is likely in the beginning), you should inspect the log files provided for the EMR Step. It could take a few minutes for them to appear in the Web interface. Also check the logs for failing tasks! Finally, make sure each job's output directory does not exist yet in S3, otherwise the job will fail. For local (i.e., on your laptop) prototyping of your Map() and Reduce() scripts, follow the instructions on the course website, replacing kjv.txt with the part / part created and downloaded above. Larger Dataset To run the ETL (and your subsequent job) on the larger dataset, create a step as follows: Select Step Type "Custom JAR" and give it a name.

8 Set JAR location to command-runner.jar Set Arguments to hadoop-streaming -Dmapred.map.tasks=100 -files s3://enron-scripts/enron-etl.py -mapper enron-etl.py -reducer cat -input s3://enron-scripts/enron-urls.txt -output s3://enron-results/f1 (replace enron-results with the S3 bucket name you just created.) Note: This is a normal Hadoop streaming job, too, but for complicated reasons we need to set a custom MapReduce parameter. NOTE: This is essentially the same as the first ETL (extract, transform, load) job as above, but now for the large/entire dataset rather than only a small subset. Thus, it generates the same three-column, tab-(\t)-separated, result containing (1) a timestamp, (2) a sender address, and (3) a recipient address, respectively, but now in s3://.../f1/ rather than s3://.../t1/. Hence, for the assignment, you need to run your own Map() & Reduce() jobs on the result of this large ETL job, just as you ran the word-count example above on the result of the small ETL job, but now using s3://.../f1/ as input rather than s3://.../t1/. Then click "Add". After the Step has started, inspect its Mapper tasks:

9 Scroll down to inspect the large number of Mapper tasks. In the current state, your cluster will take a long time to finish all those. But since this is the cloud, we can simply request more firepower: On your cluster details page, select "Resize"

10 Increase the "Core Instance Group" to a count of 5 like so: Once the additional nodes are available, the Step will process much faster. After it has been completed, run your MapReduce job on the larger results.

11 Once finished, again make sure to shut down your EMR cluster! ETL Script for reference (plain text file available at ): #!/usr/bin/env python # this turns enron archive into tuples (date, from, to) import sys import zipfile import tempfile import import time import datetime import os import urllib # stdin is list of URLs to data files for u in sys.stdin: u = u.strip() if not u: tmpf = tempfile.mkstemp() urllib.urlretrieve(u, tmpf[1]) try: except: zip = zipfile.zipfile(tmpf[1], 'r') txtf = [i for i in zip.infolist() if i.filename.endswith('.txt')] for f in txtf: msg = .message_from_file(zip.open(f)) tostr = msg.get("to") fromstr = msg.get("from") datestr = msg.get("date") if (tostr is None or fromstr is None or datestr is None): toaddrs = [ .utils.parseaddr(a) for a in tostr.split(',')] fromaddr = .utils.parseaddr(fromstr)[1].replace('\'','').strip().lower() try: # datetime hell, convert custom time zone stuff to UTC dt = datetime.datetime.strptime(datestr[:25].strip(), '%a, %d %b %Y %H:%M:%S') dt = dt + datetime.timedelta(hours = int(datestr[25:].strip()[:3])) except ValueError: if not '@' in fromaddr or '/' in fromaddr: for a in toaddrs: if (not '@' in a[1] or '/' in a[1]): ta = a[1].replace('\'','').strip().lower() print dt.isoformat() + 'Z\t' + fromaddr + '\t' + ta zip.close() os.remove(tmpf[1])

Hadoop streaming is an alternative way to program Hadoop than the traditional approach of writing and compiling Java code.

Hadoop streaming is an alternative way to program Hadoop than the traditional approach of writing and compiling Java code. title: "Data Analytics with HPC: Hadoop Walkthrough" In this walkthrough you will learn to execute simple Hadoop Map/Reduce jobs on a Hadoop cluster. We will use Hadoop to count the occurrences of words

More information

MapReduce. Arend Hintze

MapReduce. Arend Hintze MapReduce Arend Hintze Distributed Word Count Example Input data files cat * key-value pairs (0, This is a cat!) (14, cat is ok) (24, walk the dog) Mapper map() function key-value pairs (this, 1) (is,

More information

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014 CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions

More information

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016 15-319 / 15-619 Cloud Computing Recitation 3 Sep 13 & 15, 2016 1 Overview Administrative Issues Last Week s Reflection Project 1.1, OLI Unit 1, Quiz 1 This Week s Schedule Project1.2, OLI Unit 2, Module

More information

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Spring 2017, Prakash Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please) Reminders:

More information

Activity 03 AWS MapReduce

Activity 03 AWS MapReduce Implementation Activity: A03 (version: 1.0; date: 04/15/2013) 1 6 Activity 03 AWS MapReduce Purpose 1. To be able describe the MapReduce computational model 2. To be able to solve simple problems with

More information

What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)?

What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)? What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)? What is Amazon Machine Image (AMI)? Amazon Elastic Compute Cloud (EC2)?

More information

Introduction to MapReduce

Introduction to MapReduce 732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server

More information

Hadoop Exercise to Create an Inverted List

Hadoop Exercise to Create an Inverted List Hadoop Exercise to Create an Inverted List For this project you will be creating an Inverted Index of words occurring in a set of English books. We ll be using a collection of 3,036 English books written

More information

CMU MSP Intro to Hadoop

CMU MSP Intro to Hadoop CMU MSP 36602 Intro to Hadoop H. Seltman, April 3 and 5 2017 1) Carl had created an MSP virtual machine that you can download as an appliance for VirtualBox (also used for SAS University Edition). See

More information

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example: Hadoop User Guide Logging on to the Hadoop Cluster Nodes To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example: ssh username@roger-login.ncsa. illinois.edu after entering

More information

Your First Hadoop App, Step by Step

Your First Hadoop App, Step by Step Learn Hadoop in one evening Your First Hadoop App, Step by Step Martynas 1 Miliauskas @mmiliauskas Your First Hadoop App, Step by Step By Martynas Miliauskas Published in 2013 by Martynas Miliauskas On

More information

Hadoop Streaming. Table of contents. Content-Type text/html; utf-8

Hadoop Streaming. Table of contents. Content-Type text/html; utf-8 Content-Type text/html; utf-8 Table of contents 1 Hadoop Streaming...3 2 How Does Streaming Work... 3 3 Package Files With Job Submissions...4 4 Streaming Options and Usage...4 4.1 Mapper-Only Jobs...

More information

CS451 - Assignment 8 Faster Naive Bayes? Say it ain t so...

CS451 - Assignment 8 Faster Naive Bayes? Say it ain t so... CS451 - Assignment 8 Faster Naive Bayes? Say it ain t so... Part 1 due: Friday, Nov. 8 before class Part 2 due: Monday, Nov. 11 before class Part 3 due: Sunday, Nov. 17 by 11:50pm http://www.hadoopwizard.com/what-is-hadoop-a-light-hearted-view/

More information

ML from Large Datasets

ML from Large Datasets 10-605 ML from Large Datasets 1 Announcements HW1b is going out today You should now be on autolab have a an account on stoat a locally-administered Hadoop cluster shortly receive a coupon for Amazon Web

More information

STATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak

STATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak STATS 700-002 Data Analysis using Python Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak Recap Previous lecture: Hadoop/MapReduce framework in general Today s lecture: actually

More information

Release notes for version 3.9.2

Release notes for version 3.9.2 Release notes for version 3.9.2 What s new Overview Here is what we were focused on while developing version 3.9.2, and a few announcements: Continuing improving ETL capabilities of EasyMorph by adding

More information

1. Stratified sampling is advantageous when sampling each stratum independently.

1. Stratified sampling is advantageous when sampling each stratum independently. Quiz 1. 1. Stratified sampling is advantageous when sampling each stratum independently. 2. All outliers within a dataset are invalid observations. 3. Consider a dataset comprising a set of (single value)

More information

Hadoop Lab 3 Creating your first Map-Reduce Process

Hadoop Lab 3 Creating your first Map-Reduce Process Programming for Big Data Hadoop Lab 3 Creating your first Map-Reduce Process Lab work Take the map-reduce code from these notes and get it running on your Hadoop VM Driver Code Mapper Code Reducer Code

More information

CS Programming Languages: Python

CS Programming Languages: Python CS 3101-1 - Programming Languages: Python Lecture 5: Exceptions / Daniel Bauer (bauer@cs.columbia.edu) October 08 2014 Daniel Bauer CS3101-1 Python - 05 - Exceptions / 1/35 Contents Exceptions Daniel Bauer

More information

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Due by 11:59:59pm on Tuesday, March 16, 2010 This assignment is based on a similar assignment developed at the University of Washington. Running

More information

Map-Reduce and Related Systems

Map-Reduce and Related Systems Map-Reduce and Related Systems Acknowledgement The slides used in this chapter are adapted from the following sources: CS246 Mining Massive Data-sets, by Jure Leskovec, Stanford University, http://www.mmds.org

More information

Tutorial for Assignment 2.0

Tutorial for Assignment 2.0 Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2011 Slides based on last years tutorial by Florian Klien and Chris Körner 1 IMPORTANT The presented information has been tested on the

More information

Hadoop Map Reduce 10/17/2018 1

Hadoop Map Reduce 10/17/2018 1 Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018

More information

Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce

Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce Extreme Computing Introduction to MapReduce 1 Cluster We have 12 servers: scutter01, scutter02,... scutter12 If working outside Informatics, first: ssh student.ssh.inf.ed.ac.uk Then log into a random server:

More information

Hadoop Tutorial. General Instructions

Hadoop Tutorial. General Instructions CS246H: Mining Massive Datasets Hadoop Lab Winter 2018 Hadoop Tutorial General Instructions The purpose of this tutorial is to get you started with Hadoop. Completing the tutorial is optional. Here you

More information

MapReduce Design Patterns

MapReduce Design Patterns MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together

More information

Big Data Infrastructure at Spotify

Big Data Infrastructure at Spotify Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system

More information

Processing Big Data with Hadoop in Azure HDInsight

Processing Big Data with Hadoop in Azure HDInsight Processing Big Data with Hadoop in Azure HDInsight Lab 1 - Getting Started with HDInsight Overview In this lab, you will provision an HDInsight cluster. You will then run a sample MapReduce job on the

More information

Compile and Run WordCount via Command Line

Compile and Run WordCount via Command Line Aims This exercise aims to get you to: Compile, run, and debug MapReduce tasks via Command Line Compile, run, and debug MapReduce tasks via Eclipse One Tip on Hadoop File System Shell Following are the

More information

CSCI6900 Assignment 1: Naïve Bayes on Hadoop

CSCI6900 Assignment 1: Naïve Bayes on Hadoop DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 1: Naïve Bayes on Hadoop DUE: Friday, January 29 by 11:59:59pm Out January 8, 2015 1 INTRODUCTION TO NAÏVE BAYES Much of machine

More information

Universidade de Santiago de Compostela. Perldoop v0.6.3 User Manual

Universidade de Santiago de Compostela. Perldoop v0.6.3 User Manual Universidade de Santiago de Compostela Perldoop v0.6.3 User Manual José M. Abuín Mosquera Centro de Investigación en Tecnoloxías da Información (CiTIUS) November 17, 2014 Contents 1 Introduction 1 2 Perldoop

More information

Problem Set 0. General Instructions

Problem Set 0. General Instructions CS246: Mining Massive Datasets Winter 2014 Problem Set 0 Due 9:30am January 14, 2014 General Instructions This homework is to be completed individually (no collaboration is allowed). Also, you are not

More information

Processing Big Data with Hadoop in Azure HDInsight

Processing Big Data with Hadoop in Azure HDInsight Processing Big Data with Hadoop in Azure HDInsight Lab 3B Using Python Overview In this lab, you will use Python to create custom user-defined functions (UDFs), and call them from Hive and Pig. Hive provides

More information

Amazon Elastic MapReduce. API Reference API Version

Amazon Elastic MapReduce. API Reference API Version Amazon Elastic MapReduce API Reference Amazon Elastic MapReduce: API Reference Copyright 2011-2012 Amazon Web Services LLC or its affiliates. All rights reserved. Welcome... 1 Actions... 2 AddInstanceGroups...

More information

Big Data Analysis Using Hadoop and MapReduce

Big Data Analysis Using Hadoop and MapReduce Big Data Analysis Using Hadoop and MapReduce Harrison Carranza, MSIS Marist College, Harrison.Carranza2@marist.edu Mentor: Aparicio Carranza, PhD New York City College of Technology - CUNY, USA, acarranza@citytech.cuny.edu

More information

Apache Spark and Scala Certification Training

Apache Spark and Scala Certification Training About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over

More information

Creating an Inverted Index using Hadoop

Creating an Inverted Index using Hadoop Creating an Inverted Index using Hadoop Redeeming Google Cloud Credits 1. Go to https://goo.gl/gcpedu/zvmhm6 to redeem the $150 Google Cloud Platform Credit. Make sure you use your.edu email. 2. Follow

More information

MIGRATE2IAAS CLOUDSCRAPER TM V0.5 USER MANUAL. 16 Feb 2014 Copyright M2IAAS INC.

MIGRATE2IAAS CLOUDSCRAPER TM V0.5 USER MANUAL. 16 Feb 2014 Copyright M2IAAS INC. MIGRATE2IAAS CLOUDSCRAPER TM V0.5 USER MANUAL 16 Feb 2014 Copyright 2012-2014 M2IAAS INC http://www.migrate2iaas.com 1 Contents Download and install... 3 Start new transfer... 4 Before you begin... 4 1

More information

stdin, stdout, stderr

stdin, stdout, stderr stdin, stdout, stderr stdout and stderr Many programs make output to "standard out" and "standard error" (e.g. the print command goes to standard out, error messages go to standard error). By default,

More information

CS 1110, LAB 3: MODULES AND TESTING First Name: Last Name: NetID:

CS 1110, LAB 3: MODULES AND TESTING   First Name: Last Name: NetID: CS 1110, LAB 3: MODULES AND TESTING http://www.cs.cornell.edu/courses/cs11102013fa/labs/lab03.pdf First Name: Last Name: NetID: The purpose of this lab is to help you better understand functions, and to

More information

Using AVRO To Run Python Map Reduce Jobs

Using AVRO To Run Python Map Reduce Jobs Using AVRO To Run Python Map Reduce Jobs Overview This article describes how AVRO can be used write hadoop map/reduce jobs in other languages. AVRO accomplishes this by providing a stock mapper/reducer

More information

MapReduce Algorithms

MapReduce Algorithms Large-scale data processing on the Cloud Lecture 3 MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation Commons Attribution 3.0 License) Outline

More information

Amazon Web Services (AWS) Setup Guidelines

Amazon Web Services (AWS) Setup Guidelines Amazon Web Services (AWS) Setup Guidelines For CSE6242 HW3, updated version of the guidelines by Diana Maclean [Estimated time needed: 1 hour] Note that important steps are highlighted in yellow. What

More information

/ Cloud Computing. Recitation 13 April 12 th 2016

/ Cloud Computing. Recitation 13 April 12 th 2016 15-319 / 15-619 Cloud Computing Recitation 13 April 12 th 2016 Overview Last week s reflection Project 4.1 Quiz 11 Budget issues Tagging, 15619Project This week s schedule Unit 5 - Modules 21 Project 4.2

More information

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce

More information

Lab 7c: Rainfall patterns and drainage density

Lab 7c: Rainfall patterns and drainage density Lab 7c: Rainfall patterns and drainage density This is the third of a four-part handout for class the last two weeks before spring break. Due: Be done with this by class on 11/3. Task: Extract your watersheds

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version : Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors

More information

Practical Natural Language Processing with Senior Architect West Monroe Partners

Practical Natural Language Processing with Senior Architect West Monroe Partners Practical Natural Language Processing with Hadoop @DanRosanova Senior Architect West Monroe Partners A little about me & West Monroe Partners 15 years in technology consulting 5 time Microsoft Integration

More information

Web-CAT Guidelines. 1. Logging into Web-CAT

Web-CAT Guidelines. 1. Logging into Web-CAT Contents: 1. Logging into Web-CAT 2. Submitting Projects via jgrasp a. Configuring Web-CAT b. Submitting Individual Files (Example: Activity 1) c. Submitting a Project to Web-CAT d. Submitting in Web-CAT

More information

How to Implement MapReduce Using. Presented By Jamie Pitts

How to Implement MapReduce Using. Presented By Jamie Pitts How to Implement MapReduce Using Presented By Jamie Pitts A Problem Seeking A Solution Given a corpus of html-stripped financial filings: Identify and count unique subjects. Possible Solutions: 1. Use

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Prototyping Data Intensive Apps: TrendingTopics.org

Prototyping Data Intensive Apps: TrendingTopics.org Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page

More information

The body text of the page also has all newlines converted to spaces to ensure it stays on one line in this representation.

The body text of the page also has all newlines converted to spaces to ensure it stays on one line in this representation. optionalattr="val2">(body) The body text of the page also has all newlines converted to spaces to ensure it stays on one line in this representation. MapReduce Steps: This presents the

More information

XP: Backup Your Important Files for Safety

XP: Backup Your Important Files for Safety XP: Backup Your Important Files for Safety X 380 / 1 Protect Your Personal Files Against Accidental Loss with XP s Backup Wizard Your computer contains a great many important files, but when it comes to

More information

MapReduce programming model

MapReduce programming model MapReduce programming model technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Auto Print User s Manual

Auto Print User s Manual Auto Print User s Manual Welcome... 2 Configuring the Add-in... 3 AutoPrint Incoming Email Tab... 4 AutoPrint Outgoing Email Tab... 6 Print Settings Tab... 7 Print Now Tab... 9 Exceptions Tab... 10 Troubleshooting...

More information

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,

More information

AWS Setup Guidelines

AWS Setup Guidelines AWS Setup Guidelines For CSE6242 HW3, updated version of the guidelines by Diana Maclean Important steps are highlighted in yellow. What we will accomplish? This guideline helps you get set up with the

More information

Getting Started with Hadoop

Getting Started with Hadoop Getting Started with Hadoop May 28, 2018 Michael Völske, Shahbaz Syed Web Technology & Information Systems Bauhaus-Universität Weimar 1 webis 2018 What is Hadoop Started in 2004 by Yahoo Open-Source implementation

More information

Parallel Nested Loops

Parallel Nested Loops Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),

More information

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011 Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on

More information

Real-time Data Engineering in the Cloud Exercise Guide

Real-time Data Engineering in the Cloud Exercise Guide Real-time Data Engineering in the Cloud Exercise Guide Jesse Anderson 2017 SMOKING HAND LLC ALL RIGHTS RESERVED Version 1.12.a9779239 1 Contents 1 Lab Notes 3 2 Kafka HelloWorld 6 3 Streaming ETL 8 4 Advanced

More information

MapReduce Patterns, Algorithms, and Use Cases

MapReduce Patterns, Algorithms, and Use Cases MapReduce Patterns, Algorithms, and Use Cases In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web

More information

Writing and Running Programs

Writing and Running Programs Introduction to Python Writing and Running Programs Working with Lab Files These instructions take you through the steps of writing and running your first program, as well as using the lab files in our

More information

This video is part of the Microsoft Virtual Academy.

This video is part of the Microsoft Virtual Academy. This video is part of the Microsoft Virtual Academy. 1 In this session we re going to talk about building for the private cloud using the Microsoft deployment toolkit 2012, my name s Mike Niehaus, I m

More information

Molecular Statistics Exercise 1. As was shown to you this morning, the interactive python shell can add, subtract, multiply and divide numbers.

Molecular Statistics Exercise 1. As was shown to you this morning, the interactive python shell can add, subtract, multiply and divide numbers. Molecular Statistics Exercise 1 Introduction This is the first exercise in the course Molecular Statistics. The exercises in this course are split in two parts. The first part of each exercise is a general

More information

Deploying Custom Step Plugins for Pentaho MapReduce

Deploying Custom Step Plugins for Pentaho MapReduce Deploying Custom Step Plugins for Pentaho MapReduce This page intentionally left blank. Contents Overview... 1 Before You Begin... 1 Pentaho MapReduce Configuration... 2 Plugin Properties Defined... 2

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

(Refer Slide Time: 1:09)

(Refer Slide Time: 1:09) Computer Networks Prof. S. Ghosh Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecturer # 30 UDP and Client Server Good day, today we will start our discussion

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement

More information

A. Any Corps employee and any external customer or business partner who receives an invitation from a Corps user.

A. Any Corps employee and any external customer or business partner who receives an  invitation from a Corps user. Attunity RepliWeb (A-RMFT) Frequently Asked Questions (FAQ) https://filetransfer.usace.army.mil Q. What is this software? A. Attunity RepliWeb Managed File Transfer is the File Transfer Protocol (FTP)

More information

CMSC 201 Fall 2017 Lab 12 File I/O

CMSC 201 Fall 2017 Lab 12 File I/O CMSC 201 Fall 2017 Lab 12 File I/O Assignment: Lab 12 File I/O Due Date: During discussion, November 27th through November 30th Value: 10 points (8 points during lab, 2 points for Pre Lab quiz) This week

More information

CIEL Tutorial. Connecting to your personal cluster

CIEL Tutorial. Connecting to your personal cluster CIEL Tutorial This page provides instructions for setting up your own CIEL installation on Amazon's Elastic Compute Cluster (EC2), and running some basic Skywriting jobs. This tutorial is based largely

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

More information

Make sure you have the latest Hive trunk by running svn up in your Hive directory. More detailed instructions on downloading and setting up

Make sure you have the latest Hive trunk by running svn up in your Hive directory. More detailed instructions on downloading and setting up GenericUDAFCaseStudy Writing GenericUDAFs: A Tutorial User-Defined Aggregation Functions (UDAFs) are an excellent way to integrate advanced data-processing into Hive. Hive allows two varieties of UDAFs:

More information

PDI Techniques Logging and Monitoring

PDI Techniques Logging and Monitoring PDI Techniques Logging and Monitoring Change log (if you want to use it): Date Version Author Changes Contents Overview... 1 Before You Begin... 1 Terms You Should Know... 1 Use Case: Setting Appropriate

More information

Organising your inbox

Organising your inbox Outlook 2010 Tips Table of Contents Organising your inbox... 1 Categories... 1 Applying a Category to an E-mail... 1 Customising Categories... 1 Quick Steps... 2 Default Quick Steps... 2 To configure or

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

mrjob Documentation Release dev0 Steve Johnson

mrjob Documentation Release dev0 Steve Johnson mrjob Documentation Release 0.6.3.dev0 Steve Johnson March 30, 2018 Contents 1 Guides 3 1.1 Why mrjob?............................................... 3 1.2 Fundamentals...............................................

More information

Integrating Beamr Video Into a Video Encoding Workflow By: Jan Ozer

Integrating Beamr Video Into a Video Encoding Workflow By: Jan Ozer Integrating Beamr Video Into a Video Encoding Workflow By: Jan Ozer Beamr Video is a perceptual video optimizer that significantly reduces the bitrate of video streams without compromising quality, enabling

More information

Chronix A fast and efficient time series storage based on Apache Solr. Caution: Contains technical content.

Chronix A fast and efficient time series storage based on Apache Solr. Caution: Contains technical content. Chronix A fast and efficient time series storage based on Apache Solr Caution: Contains technical content. 68.000.000.000* time correlated data objects. How to store such amount of data on your laptop

More information

CS158 - Assignment 9 Faster Naive Bayes? Say it ain t so...

CS158 - Assignment 9 Faster Naive Bayes? Say it ain t so... CS158 - Assignment 9 Faster Naive Bayes? Say it ain t so... Part 1 due: Sunday, Nov. 13 by 11:59pm Part 2 due: Sunday, Nov. 20 by 11:59pm http://www.hadoopwizard.com/what-is-hadoop-a-light-hearted-view/

More information

Com S 227 Assignment Submission HOWTO

Com S 227 Assignment Submission HOWTO Com S 227 Assignment Submission HOWTO This document provides detailed instructions on: 1. How to submit an assignment via Canvas and check it 3. How to examine the contents of a zip file 3. How to create

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

Illustrated Guide to the. UTeach. Electronic Portfolio

Illustrated Guide to the. UTeach. Electronic Portfolio Illustrated Guide to the UTeach Electronic Portfolio UTeach Portfolio Guide 1 Revised Spring 2011 The Electronic Portfolio All UTeach students have access to the electronic portfolio. If you can t log

More information

Lambda Architecture for Batch and Stream Processing. October 2018

Lambda Architecture for Batch and Stream Processing. October 2018 Lambda Architecture for Batch and Stream Processing October 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only.

More information

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

Actual4Dumps.   Provide you with the latest actual exam dumps, and help you succeed Actual4Dumps http://www.actual4dumps.com Provide you with the latest actual exam dumps, and help you succeed Exam : HDPCD Title : Hortonworks Data Platform Certified Developer Vendor : Hortonworks Version

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

61A Lecture 36. Wednesday, November 30

61A Lecture 36. Wednesday, November 30 61A Lecture 36 Wednesday, November 30 Project 4 Contest Gallery Prizes will be awarded for the winning entry in each of the following categories. Featherweight. At most 128 words of Logo, not including

More information

Lecture Transcript While and Do While Statements in C++

Lecture Transcript While and Do While Statements in C++ Lecture Transcript While and Do While Statements in C++ Hello and welcome back. In this lecture we are going to look at the while and do...while iteration statements in C++. Here is a quick recap of some

More information

CSCI0931 Intro Comp for Humanities & Soc Sci Jun Ki Lee. Final Project. Rubric

CSCI0931 Intro Comp for Humanities & Soc Sci Jun Ki Lee. Final Project. Rubric CSCI0931 Intro Comp for Humanities & Soc Sci Jun Ki Lee Final Project Rubric Note: The Final Project is 30% of the total grade for this course. Name: Category Proposal 10 Meetings with Staff 5 Design Elements

More information

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018 CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018 Lecture 4: Apache Pig Aidan Hogan aidhog@gmail.com HADOOP: WRAPPING UP 0. Reading/Writing to HDFS Creates a file system for default configuration Check

More information

For Volunteers An Elvanto Guide

For Volunteers An Elvanto Guide For Volunteers An Elvanto Guide www.elvanto.com Volunteers are what keep churches running! This guide is for volunteers who use Elvanto. If you re in charge of volunteers, why not check out our Volunteer

More information