Map/Reduce on the Enron dataset

Similar documents
Hadoop streaming is an alternative way to program Hadoop than the traditional approach of writing and compiling Java code.

MapReduce. Arend Hintze

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Activity 03 AWS MapReduce

What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)?

Introduction to MapReduce

Hadoop Exercise to Create an Inverted List

CMU MSP Intro to Hadoop

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Your First Hadoop App, Step by Step

Hadoop Streaming. Table of contents. Content-Type text/html; utf-8

CS451 - Assignment 8 Faster Naive Bayes? Say it ain t so...

ML from Large Datasets

STATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak

Release notes for version 3.9.2

1. Stratified sampling is advantageous when sampling each stratum independently.

Hadoop Lab 3 Creating your first Map-Reduce Process

CS Programming Languages: Python

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Map-Reduce and Related Systems

Tutorial for Assignment 2.0

Hadoop Map Reduce 10/17/2018 1

Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce

Hadoop Tutorial. General Instructions

MapReduce Design Patterns

Big Data Infrastructure at Spotify

Processing Big Data with Hadoop in Azure HDInsight

Compile and Run WordCount via Command Line

CSCI6900 Assignment 1: Naïve Bayes on Hadoop

Universidade de Santiago de Compostela. Perldoop v0.6.3 User Manual

Problem Set 0. General Instructions

Processing Big Data with Hadoop in Azure HDInsight

Amazon Elastic MapReduce. API Reference API Version

Big Data Analysis Using Hadoop and MapReduce

Apache Spark and Scala Certification Training

Creating an Inverted Index using Hadoop

MIGRATE2IAAS CLOUDSCRAPER TM V0.5 USER MANUAL. 16 Feb 2014 Copyright M2IAAS INC.

stdin, stdout, stderr

CS 1110, LAB 3: MODULES AND TESTING First Name: Last Name: NetID:

Using AVRO To Run Python Map Reduce Jobs

MapReduce Algorithms

Amazon Web Services (AWS) Setup Guidelines

/ Cloud Computing. Recitation 13 April 12 th 2016

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Lab 7c: Rainfall patterns and drainage density

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Click Stream Data Analysis Using Hadoop

Practical Natural Language Processing with Senior Architect West Monroe Partners

Web-CAT Guidelines. 1. Logging into Web-CAT

How to Implement MapReduce Using. Presented By Jamie Pitts

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Prototyping Data Intensive Apps: TrendingTopics.org

The body text of the page also has all newlines converted to spaces to ensure it stays on one line in this representation.

XP: Backup Your Important Files for Safety

MapReduce programming model

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Auto Print User s Manual

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

AWS Setup Guidelines

Getting Started with Hadoop

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Real-time Data Engineering in the Cloud Exercise Guide

MapReduce Patterns, Algorithms, and Use Cases

Writing and Running Programs

This video is part of the Microsoft Virtual Academy.

Molecular Statistics Exercise 1. As was shown to you this morning, the interactive python shell can add, subtract, multiply and divide numbers.

Deploying Custom Step Plugins for Pentaho MapReduce

Data Partitioning and MapReduce

(Refer Slide Time: 1:09)

TI2736-B Big Data Processing. Claudia Hauff

A. Any Corps employee and any external customer or business partner who receives an invitation from a Corps user.

CMSC 201 Fall 2017 Lab 12 File I/O

CIEL Tutorial. Connecting to your personal cluster

Map Reduce & Hadoop Recommended Text:

Make sure you have the latest Hive trunk by running svn up in your Hive directory. More detailed instructions on downloading and setting up

PDI Techniques Logging and Monitoring

Organising your inbox

Big Data Management and NoSQL Databases

mrjob Documentation Release dev0 Steve Johnson

Integrating Beamr Video Into a Video Encoding Workflow By: Jan Ozer

Chronix A fast and efficient time series storage based on Apache Solr. Caution: Contains technical content.

CS158 - Assignment 9 Faster Naive Bayes? Say it ain t so...

Com S 227 Assignment Submission HOWTO

Hadoop and Map-reduce computing

Illustrated Guide to the. UTeach. Electronic Portfolio

Lambda Architecture for Batch and Stream Processing. October 2018

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

Developing MapReduce Programs

61A Lecture 36. Wednesday, November 30

Lecture Transcript While and Do While Statements in C++

CSCI0931 Intro Comp for Humanities & Soc Sci Jun Ki Lee. Final Project. Rubric

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

For Volunteers An Elvanto Guide

Transcription:

Map/Reduce on the Enron dataset We are going to use EMR on the Enron email dataset: http://aws.amazon.com/datasets/enron-email-data/ https://en.wikipedia.org/wiki/enron_scandal This dataset contains 1,227,255 emails from Enron employees. The version we use consists of 50 GB of compressed files. Consider the following scenario: Sept 9, 2001 (really), the The New York Times ran an article titled "MARKET WATCH; A Self-Inflicted Wound Aggravates Angst Over Enron" ( http://www.webcitation.org/5tz2mrm4u ). Someone (your boss?) wants to find out who frequently talked to the press in the days before. You are handed a dump of the email server. Technically, this task consists of the following steps: - Put the dataset into S3 (the ugly part, already done for you) - Extract the date/sender/recipient from the email data (this is what is described in detail below) - Filter the data to - only consider emails between 2001-09-05 and 2001-09-08. - only consider messages going from ENRON employees to someone not part of the organization - Count the number of foreign interactions and only include accounts that have more than one outside contact that week. To achieve this, you need to create a set of MapReduce jobs. We are going to implement those in Python, using the Hadoop streaming feature also available in EMR. http://hadoop.apache.org/docs/r1.2.1/streaming.html If you are new to Python, check out http://www.afterhoursprogramming.com/tutorial/python/introduction/ Here is an example of using Python and Hadoop Streaming on EMR https://dbaumgartel.wordpress.com/2014/04/10/an-elastic-mapreduce-streaming-example-with-pytho n-and-ngrams-on-aws/ (not all details relevant here). In Hadoop Streaming, Python MapReduce programs are given a part of the input data on the standard system input (stdin) and are expected to write tab-separated tables on the standard output (stdout). Here is a working skeleton for a map or reduce function: #!/usr/bin/env python import sys for line in sys.stdin: line = line.strip().split('\t') # do something with line[0], line[1] etc. print("some_key\tsome Payload")

The reducer counterpart starts very similar, but has one important difference: All the values with the same Key from the mapper will follow each other, which allows them to be combined. First, let's start a (small) cluster. Log into AWS at https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1 We are going to start with a small cluster. This time, the simple configuration is fine, and most of the defaults can stay the way they are. Some version numbers might be higher/newer than in below screenshot(s), but that should be fine. For hardware configuration, we are going to start with a 2-node cluster of m1.large instances Wait until your cluster has started ("Waiting"). While you are waiting, we need to create a S3 bucket for the output of our analysis.

Click "Create Bucket" Select the "US Standard" region and a name for your bucket. This name needs to be globally unique for S3. Then click "Create". Back to EMR: First, we are going to run the data transformation on a small part of the dataset. On your (now hopefully ready soon) cluster page, select "Steps", then "Add Step".

Select Step Type "Streaming Program" and give it a name. Further, set Mapper to s3://enron-scripts/enron-etl.py Reducer to cat Input S3 Location to s3://enron-scripts/enron-urls-small.txt And output S3 location to s3://enron-results/t1 (replace enron-results with the S3 bucket name you just created.) Then click "Add".

You will see the MapReduce job starting, going from "Pending", to "Running" and then hopefully to "Completed". If something goes wrong, inspect the log files! If all went well, it is time to inspect the results in S3. Right-click the file part-00000 and download it to inspect its contents. You will find three columns, separated by tab (\t) character, containing (1) a timestamp, (2) a sender email address, and (3) a recipient email address, respectively. In other words, the enron-etl.py you just ran extracted from the raw data exactly the information required for the analysis described above, i.e., for your task. 2001-04-18T02:58:00Z bsitz@mail.utexas.edu steven.p.south@enron.com 2001-04-18T02:58:00Z bsitz@mail.utexas.edu steven.p.south@enron.com 2001-04-18T02:58:00Z bsitz@mail.utexas.edu steven.p.south@enron.com 2001-04-18T16:11:00Z enerfax1@bellsouth.net enerfaxweb@yahoogroups.com 2001-04-18T16:11:00Z enerfax1@bellsouth.net enerfaxweb@yahoogroups.com 2001-04-19T03:38:00Z linda.w.humphrey@williams.com kbbaker@ppco.com 2001-04-19T03:38:00Z linda.w.humphrey@williams.com kevin.coyle@cmenergy.com 2001-04-19T03:38:00Z linda.w.humphrey@williams.com kims@kimballenergy.com 2001-04-19T03:38:00Z linda.w.humphrey@williams.com kmorrow@bcgas.com

In fact, AWS/EMR/Hadoop might choose to use more than one reducer (check the View jobs and View tasks as described below for details), and then the result will be distributed over more than one file. In my latest test, AWS/EMR/Hadoop used two reducers for my job, resulting in two files, i.e., part-00000 & part-00001. This file (or these files) will be the input for your next MapReduce job as described above. ( Tip: If you specify as Input S3 Location to not a file (as s3://enron-scripts/enron-urls-small.txt in the above example) but a directory (folder), e.g., the s3://enron-results/t1 result folder you used above, AWS/EMR/Hadoop will automatically iterate of all files in the directory (folder), i.e., you do not need to concatenate them yourself in any way.) Create a mapper.py and a reducer.py script, upload them to your S3 bucket, point to them in the Step "Streaming" step creation and run them. See the skeleton further up for an example. The Mapper is expected to output a key and values separated by a tab (\t) character. As mentioned in the slides, the Mapper typically filters records and outputs them with the common key, and the reducers read the files with the common key and output an aggregation. Here are examples for Hadoop Streaming Mappers and Reducers doing Wordcount (text files are available at http://homepages.cwi.nl/~manegold/uva-abs-mba-bdba-bdit-2017 /Wordcount-Mapper.py and http://homepages.cwi.nl/~manegold/uva-abs-mba-bdba-bdit-2017/wordcount-reducer.py ): Mapper #!/usr/bin/env python import sys # line-wise iterate over standard input (stdin) for line in sys.stdin: # split line (after stripping off any leading/trailing whitespace) # on whitespaces into "words" words = line.strip().split() # iterate over all words of a line for word in words: # print word (after stripping off any leading/trailing # whitespace) as key and number "1" as value # as tab-('\t')-separated (key,value) pair print(word.strip()+ "\t1") Reducer #!/usr/bin/env python import sys # initialize variables current_count = 0 current_word = ""

# line-wise iterate over standard input (stdin) # (recall, each line is expected to consist of a tab-('\t')-separated # (key,value) pair) for line in sys.stdin: # split line (after stripping off any leading/trailing whitespace) # on tab ('\t') into key & value line = line.strip().split('\t') # sanity check: did we indeed get exactly two parts (key & value)? # if not, skip line and with next line if len(line)!= 2: # extract key key = line[0] # new (next) key # (recall, keys are expected to arrive in sorted order) if (key!= current_word): if (current_count > 0): # print previous key and aggregated count # as tab-('\t')-separated (key,value) pair print(current_word + '\t' + str(current_count)) # reset counter to 0 and recall new key current_count = 0 current_word = key # increment count by 1 current_count += 1 if (current_count > 0): # print last key and its aggregated count # as tab-('\t')-separated (key,value) pair print(current_word + '\t' + str(current_count)) If anything goes wrong (which is likely in the beginning), you should inspect the log files provided for the EMR Step. It could take a few minutes for them to appear in the Web interface. Also check the logs for failing tasks! Finally, make sure each job's output directory does not exist yet in S3, otherwise the job will fail. For local (i.e., on your laptop) prototyping of your Map() and Reduce() scripts, follow the instructions on the course website, replacing kjv.txt with the part-00000 / part-00001 created and downloaded above. Larger Dataset To run the ETL (and your subsequent job) on the larger dataset, create a step as follows: Select Step Type "Custom JAR" and give it a name.

Set JAR location to command-runner.jar Set Arguments to hadoop-streaming -Dmapred.map.tasks=100 -files s3://enron-scripts/enron-etl.py -mapper enron-etl.py -reducer cat -input s3://enron-scripts/enron-urls.txt -output s3://enron-results/f1 (replace enron-results with the S3 bucket name you just created.) Note: This is a normal Hadoop streaming job, too, but for complicated reasons we need to set a custom MapReduce parameter. NOTE: This is essentially the same as the first ETL (extract, transform, load) job as above, but now for the large/entire dataset rather than only a small subset. Thus, it generates the same three-column, tab-(\t)-separated, result containing (1) a timestamp, (2) a sender email address, and (3) a recipient email address, respectively, but now in s3://.../f1/ rather than s3://.../t1/. Hence, for the assignment, you need to run your own Map() & Reduce() jobs on the result of this large ETL job, just as you ran the word-count example above on the result of the small ETL job, but now using s3://.../f1/ as input rather than s3://.../t1/. Then click "Add". After the Step has started, inspect its Mapper tasks:

Scroll down to inspect the large number of Mapper tasks. In the current state, your cluster will take a long time to finish all those. But since this is the cloud, we can simply request more firepower: On your cluster details page, select "Resize"

Increase the "Core Instance Group" to a count of 5 like so: Once the additional nodes are available, the Step will process much faster. After it has been completed, run your MapReduce job on the larger results.

Once finished, again make sure to shut down your EMR cluster! ETL Script for reference (plain text file available at http://homepages.cwi.nl/~manegold/uva-abs-mba-bdba-bdit-2017/enron-etl.py ): #!/usr/bin/env python # this turns enron email archive into tuples (date, from, to) import sys import zipfile import tempfile import email import time import datetime import os import urllib # stdin is list of URLs to data files for u in sys.stdin: u = u.strip() if not u: tmpf = tempfile.mkstemp() urllib.urlretrieve(u, tmpf[1]) try: except: zip = zipfile.zipfile(tmpf[1], 'r') txtf = [i for i in zip.infolist() if i.filename.endswith('.txt')] for f in txtf: msg = email.message_from_file(zip.open(f)) tostr = msg.get("to") fromstr = msg.get("from") datestr = msg.get("date") if (tostr is None or fromstr is None or datestr is None): toaddrs = [email.utils.parseaddr(a) for a in tostr.split(',')] fromaddr = email.utils.parseaddr(fromstr)[1].replace('\'','').strip().lower() try: # datetime hell, convert custom time zone stuff to UTC dt = datetime.datetime.strptime(datestr[:25].strip(), '%a, %d %b %Y %H:%M:%S') dt = dt + datetime.timedelta(hours = int(datestr[25:].strip()[:3])) except ValueError: if not '@' in fromaddr or '/' in fromaddr: for a in toaddrs: if (not '@' in a[1] or '/' in a[1]): ta = a[1].replace('\'','').strip().lower() print dt.isoformat() + 'Z\t' + fromaddr + '\t' + ta zip.close() os.remove(tmpf[1])