Map/Reduce on the Enron dataset
|
|
- Austen Wilkerson
- 6 years ago
- Views:
Transcription
1 Map/Reduce on the Enron dataset We are going to use EMR on the Enron dataset: This dataset contains 1,227,255 s from Enron employees. The version we use consists of 50 GB of compressed files. Consider the following scenario: Sept 9, 2001 (really), the The New York Times ran an article titled "MARKET WATCH; A Self-Inflicted Wound Aggravates Angst Over Enron" ( ). Someone (your boss?) wants to find out who frequently talked to the press in the days before. You are handed a dump of the server. Technically, this task consists of the following steps: - Put the dataset into S3 (the ugly part, already done for you) - Extract the date/sender/recipient from the data (this is what is described in detail below) - Filter the data to - only consider s between and only consider messages going from ENRON employees to someone not part of the organization - Count the number of foreign interactions and only include accounts that have more than one outside contact that week. To achieve this, you need to create a set of MapReduce jobs. We are going to implement those in Python, using the Hadoop streaming feature also available in EMR. If you are new to Python, check out Here is an example of using Python and Hadoop Streaming on EMR n-and-ngrams-on-aws/ (not all details relevant here). In Hadoop Streaming, Python MapReduce programs are given a part of the input data on the standard system input (stdin) and are expected to write tab-separated tables on the standard output (stdout). Here is a working skeleton for a map or reduce function: #!/usr/bin/env python import sys for line in sys.stdin: line = line.strip().split('\t') # do something with line[0], line[1] etc. print("some_key\tsome Payload")
2 The reducer counterpart starts very similar, but has one important difference: All the values with the same Key from the mapper will follow each other, which allows them to be combined. First, let's start a (small) cluster. Log into AWS at We are going to start with a small cluster. This time, the simple configuration is fine, and most of the defaults can stay the way they are. Some version numbers might be higher/newer than in below screenshot(s), but that should be fine. For hardware configuration, we are going to start with a 2-node cluster of m1.large instances Wait until your cluster has started ("Waiting"). While you are waiting, we need to create a S3 bucket for the output of our analysis.
3 Click "Create Bucket" Select the "US Standard" region and a name for your bucket. This name needs to be globally unique for S3. Then click "Create". Back to EMR: First, we are going to run the data transformation on a small part of the dataset. On your (now hopefully ready soon) cluster page, select "Steps", then "Add Step".
4 Select Step Type "Streaming Program" and give it a name. Further, set Mapper to s3://enron-scripts/enron-etl.py Reducer to cat Input S3 Location to s3://enron-scripts/enron-urls-small.txt And output S3 location to s3://enron-results/t1 (replace enron-results with the S3 bucket name you just created.) Then click "Add".
5 You will see the MapReduce job starting, going from "Pending", to "Running" and then hopefully to "Completed". If something goes wrong, inspect the log files! If all went well, it is time to inspect the results in S3. Right-click the file part and download it to inspect its contents. You will find three columns, separated by tab (\t) character, containing (1) a timestamp, (2) a sender address, and (3) a recipient address, respectively. In other words, the enron-etl.py you just ran extracted from the raw data exactly the information required for the analysis described above, i.e., for your task T02:58:00Z bsitz@mail.utexas.edu steven.p.south@enron.com T02:58:00Z bsitz@mail.utexas.edu steven.p.south@enron.com T02:58:00Z bsitz@mail.utexas.edu steven.p.south@enron.com T16:11:00Z enerfax1@bellsouth.net enerfaxweb@yahoogroups.com T16:11:00Z enerfax1@bellsouth.net enerfaxweb@yahoogroups.com T03:38:00Z linda.w.humphrey@williams.com kbbaker@ppco.com T03:38:00Z linda.w.humphrey@williams.com kevin.coyle@cmenergy.com T03:38:00Z linda.w.humphrey@williams.com kims@kimballenergy.com T03:38:00Z linda.w.humphrey@williams.com kmorrow@bcgas.com
6 In fact, AWS/EMR/Hadoop might choose to use more than one reducer (check the View jobs and View tasks as described below for details), and then the result will be distributed over more than one file. In my latest test, AWS/EMR/Hadoop used two reducers for my job, resulting in two files, i.e., part & part This file (or these files) will be the input for your next MapReduce job as described above. ( Tip: If you specify as Input S3 Location to not a file (as s3://enron-scripts/enron-urls-small.txt in the above example) but a directory (folder), e.g., the s3://enron-results/t1 result folder you used above, AWS/EMR/Hadoop will automatically iterate of all files in the directory (folder), i.e., you do not need to concatenate them yourself in any way.) Create a mapper.py and a reducer.py script, upload them to your S3 bucket, point to them in the Step "Streaming" step creation and run them. See the skeleton further up for an example. The Mapper is expected to output a key and values separated by a tab (\t) character. As mentioned in the slides, the Mapper typically filters records and outputs them with the common key, and the reducers read the files with the common key and output an aggregation. Here are examples for Hadoop Streaming Mappers and Reducers doing Wordcount (text files are available at /Wordcount-Mapper.py and ): Mapper #!/usr/bin/env python import sys # line-wise iterate over standard input (stdin) for line in sys.stdin: # split line (after stripping off any leading/trailing whitespace) # on whitespaces into "words" words = line.strip().split() # iterate over all words of a line for word in words: # print word (after stripping off any leading/trailing # whitespace) as key and number "1" as value # as tab-('\t')-separated (key,value) pair print(word.strip()+ "\t1") Reducer #!/usr/bin/env python import sys # initialize variables current_count = 0 current_word = ""
7 # line-wise iterate over standard input (stdin) # (recall, each line is expected to consist of a tab-('\t')-separated # (key,value) pair) for line in sys.stdin: # split line (after stripping off any leading/trailing whitespace) # on tab ('\t') into key & value line = line.strip().split('\t') # sanity check: did we indeed get exactly two parts (key & value)? # if not, skip line and with next line if len(line)!= 2: # extract key key = line[0] # new (next) key # (recall, keys are expected to arrive in sorted order) if (key!= current_word): if (current_count > 0): # print previous key and aggregated count # as tab-('\t')-separated (key,value) pair print(current_word + '\t' + str(current_count)) # reset counter to 0 and recall new key current_count = 0 current_word = key # increment count by 1 current_count += 1 if (current_count > 0): # print last key and its aggregated count # as tab-('\t')-separated (key,value) pair print(current_word + '\t' + str(current_count)) If anything goes wrong (which is likely in the beginning), you should inspect the log files provided for the EMR Step. It could take a few minutes for them to appear in the Web interface. Also check the logs for failing tasks! Finally, make sure each job's output directory does not exist yet in S3, otherwise the job will fail. For local (i.e., on your laptop) prototyping of your Map() and Reduce() scripts, follow the instructions on the course website, replacing kjv.txt with the part / part created and downloaded above. Larger Dataset To run the ETL (and your subsequent job) on the larger dataset, create a step as follows: Select Step Type "Custom JAR" and give it a name.
8 Set JAR location to command-runner.jar Set Arguments to hadoop-streaming -Dmapred.map.tasks=100 -files s3://enron-scripts/enron-etl.py -mapper enron-etl.py -reducer cat -input s3://enron-scripts/enron-urls.txt -output s3://enron-results/f1 (replace enron-results with the S3 bucket name you just created.) Note: This is a normal Hadoop streaming job, too, but for complicated reasons we need to set a custom MapReduce parameter. NOTE: This is essentially the same as the first ETL (extract, transform, load) job as above, but now for the large/entire dataset rather than only a small subset. Thus, it generates the same three-column, tab-(\t)-separated, result containing (1) a timestamp, (2) a sender address, and (3) a recipient address, respectively, but now in s3://.../f1/ rather than s3://.../t1/. Hence, for the assignment, you need to run your own Map() & Reduce() jobs on the result of this large ETL job, just as you ran the word-count example above on the result of the small ETL job, but now using s3://.../f1/ as input rather than s3://.../t1/. Then click "Add". After the Step has started, inspect its Mapper tasks:
9 Scroll down to inspect the large number of Mapper tasks. In the current state, your cluster will take a long time to finish all those. But since this is the cloud, we can simply request more firepower: On your cluster details page, select "Resize"
10 Increase the "Core Instance Group" to a count of 5 like so: Once the additional nodes are available, the Step will process much faster. After it has been completed, run your MapReduce job on the larger results.
11 Once finished, again make sure to shut down your EMR cluster! ETL Script for reference (plain text file available at ): #!/usr/bin/env python # this turns enron archive into tuples (date, from, to) import sys import zipfile import tempfile import import time import datetime import os import urllib # stdin is list of URLs to data files for u in sys.stdin: u = u.strip() if not u: tmpf = tempfile.mkstemp() urllib.urlretrieve(u, tmpf[1]) try: except: zip = zipfile.zipfile(tmpf[1], 'r') txtf = [i for i in zip.infolist() if i.filename.endswith('.txt')] for f in txtf: msg = .message_from_file(zip.open(f)) tostr = msg.get("to") fromstr = msg.get("from") datestr = msg.get("date") if (tostr is None or fromstr is None or datestr is None): toaddrs = [ .utils.parseaddr(a) for a in tostr.split(',')] fromaddr = .utils.parseaddr(fromstr)[1].replace('\'','').strip().lower() try: # datetime hell, convert custom time zone stuff to UTC dt = datetime.datetime.strptime(datestr[:25].strip(), '%a, %d %b %Y %H:%M:%S') dt = dt + datetime.timedelta(hours = int(datestr[25:].strip()[:3])) except ValueError: if not '@' in fromaddr or '/' in fromaddr: for a in toaddrs: if (not '@' in a[1] or '/' in a[1]): ta = a[1].replace('\'','').strip().lower() print dt.isoformat() + 'Z\t' + fromaddr + '\t' + ta zip.close() os.remove(tmpf[1])
Hadoop streaming is an alternative way to program Hadoop than the traditional approach of writing and compiling Java code.
title: "Data Analytics with HPC: Hadoop Walkthrough" In this walkthrough you will learn to execute simple Hadoop Map/Reduce jobs on a Hadoop cluster. We will use Hadoop to count the occurrences of words
More informationMapReduce. Arend Hintze
MapReduce Arend Hintze Distributed Word Count Example Input data files cat * key-value pairs (0, This is a cat!) (14, cat is ok) (24, walk the dog) Mapper map() function key-value pairs (this, 1) (is,
More informationCS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014
CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions
More information/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016
15-319 / 15-619 Cloud Computing Recitation 3 Sep 13 & 15, 2016 1 Overview Administrative Issues Last Week s Reflection Project 1.1, OLI Unit 1, Quiz 1 This Week s Schedule Project1.2, OLI Unit 2, Module
More informationHomework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)
Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Spring 2017, Prakash Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please) Reminders:
More informationActivity 03 AWS MapReduce
Implementation Activity: A03 (version: 1.0; date: 04/15/2013) 1 6 Activity 03 AWS MapReduce Purpose 1. To be able describe the MapReduce computational model 2. To be able to solve simple problems with
More informationWhat is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)?
What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)? What is Amazon Machine Image (AMI)? Amazon Elastic Compute Cloud (EC2)?
More informationIntroduction to MapReduce
732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server
More informationHadoop Exercise to Create an Inverted List
Hadoop Exercise to Create an Inverted List For this project you will be creating an Inverted Index of words occurring in a set of English books. We ll be using a collection of 3,036 English books written
More informationCMU MSP Intro to Hadoop
CMU MSP 36602 Intro to Hadoop H. Seltman, April 3 and 5 2017 1) Carl had created an MSP virtual machine that you can download as an appliance for VirtualBox (also used for SAS University Edition). See
More informationLogging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:
Hadoop User Guide Logging on to the Hadoop Cluster Nodes To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example: ssh username@roger-login.ncsa. illinois.edu after entering
More informationYour First Hadoop App, Step by Step
Learn Hadoop in one evening Your First Hadoop App, Step by Step Martynas 1 Miliauskas @mmiliauskas Your First Hadoop App, Step by Step By Martynas Miliauskas Published in 2013 by Martynas Miliauskas On
More informationHadoop Streaming. Table of contents. Content-Type text/html; utf-8
Content-Type text/html; utf-8 Table of contents 1 Hadoop Streaming...3 2 How Does Streaming Work... 3 3 Package Files With Job Submissions...4 4 Streaming Options and Usage...4 4.1 Mapper-Only Jobs...
More informationCS451 - Assignment 8 Faster Naive Bayes? Say it ain t so...
CS451 - Assignment 8 Faster Naive Bayes? Say it ain t so... Part 1 due: Friday, Nov. 8 before class Part 2 due: Monday, Nov. 11 before class Part 3 due: Sunday, Nov. 17 by 11:50pm http://www.hadoopwizard.com/what-is-hadoop-a-light-hearted-view/
More informationML from Large Datasets
10-605 ML from Large Datasets 1 Announcements HW1b is going out today You should now be on autolab have a an account on stoat a locally-administered Hadoop cluster shortly receive a coupon for Amazon Web
More informationSTATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak
STATS 700-002 Data Analysis using Python Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak Recap Previous lecture: Hadoop/MapReduce framework in general Today s lecture: actually
More informationRelease notes for version 3.9.2
Release notes for version 3.9.2 What s new Overview Here is what we were focused on while developing version 3.9.2, and a few announcements: Continuing improving ETL capabilities of EasyMorph by adding
More information1. Stratified sampling is advantageous when sampling each stratum independently.
Quiz 1. 1. Stratified sampling is advantageous when sampling each stratum independently. 2. All outliers within a dataset are invalid observations. 3. Consider a dataset comprising a set of (single value)
More informationHadoop Lab 3 Creating your first Map-Reduce Process
Programming for Big Data Hadoop Lab 3 Creating your first Map-Reduce Process Lab work Take the map-reduce code from these notes and get it running on your Hadoop VM Driver Code Mapper Code Reducer Code
More informationCS Programming Languages: Python
CS 3101-1 - Programming Languages: Python Lecture 5: Exceptions / Daniel Bauer (bauer@cs.columbia.edu) October 08 2014 Daniel Bauer CS3101-1 Python - 05 - Exceptions / 1/35 Contents Exceptions Daniel Bauer
More informationAssignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis
Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Due by 11:59:59pm on Tuesday, March 16, 2010 This assignment is based on a similar assignment developed at the University of Washington. Running
More informationMap-Reduce and Related Systems
Map-Reduce and Related Systems Acknowledgement The slides used in this chapter are adapted from the following sources: CS246 Mining Massive Data-sets, by Jure Leskovec, Stanford University, http://www.mmds.org
More informationTutorial for Assignment 2.0
Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2011 Slides based on last years tutorial by Florian Klien and Chris Körner 1 IMPORTANT The presented information has been tested on the
More informationHadoop Map Reduce 10/17/2018 1
Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018
More informationExtreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce
Extreme Computing Introduction to MapReduce 1 Cluster We have 12 servers: scutter01, scutter02,... scutter12 If working outside Informatics, first: ssh student.ssh.inf.ed.ac.uk Then log into a random server:
More informationHadoop Tutorial. General Instructions
CS246H: Mining Massive Datasets Hadoop Lab Winter 2018 Hadoop Tutorial General Instructions The purpose of this tutorial is to get you started with Hadoop. Completing the tutorial is optional. Here you
More informationMapReduce Design Patterns
MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together
More informationBig Data Infrastructure at Spotify
Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system
More informationProcessing Big Data with Hadoop in Azure HDInsight
Processing Big Data with Hadoop in Azure HDInsight Lab 1 - Getting Started with HDInsight Overview In this lab, you will provision an HDInsight cluster. You will then run a sample MapReduce job on the
More informationCompile and Run WordCount via Command Line
Aims This exercise aims to get you to: Compile, run, and debug MapReduce tasks via Command Line Compile, run, and debug MapReduce tasks via Eclipse One Tip on Hadoop File System Shell Following are the
More informationCSCI6900 Assignment 1: Naïve Bayes on Hadoop
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 1: Naïve Bayes on Hadoop DUE: Friday, January 29 by 11:59:59pm Out January 8, 2015 1 INTRODUCTION TO NAÏVE BAYES Much of machine
More informationUniversidade de Santiago de Compostela. Perldoop v0.6.3 User Manual
Universidade de Santiago de Compostela Perldoop v0.6.3 User Manual José M. Abuín Mosquera Centro de Investigación en Tecnoloxías da Información (CiTIUS) November 17, 2014 Contents 1 Introduction 1 2 Perldoop
More informationProblem Set 0. General Instructions
CS246: Mining Massive Datasets Winter 2014 Problem Set 0 Due 9:30am January 14, 2014 General Instructions This homework is to be completed individually (no collaboration is allowed). Also, you are not
More informationProcessing Big Data with Hadoop in Azure HDInsight
Processing Big Data with Hadoop in Azure HDInsight Lab 3B Using Python Overview In this lab, you will use Python to create custom user-defined functions (UDFs), and call them from Hive and Pig. Hive provides
More informationAmazon Elastic MapReduce. API Reference API Version
Amazon Elastic MapReduce API Reference Amazon Elastic MapReduce: API Reference Copyright 2011-2012 Amazon Web Services LLC or its affiliates. All rights reserved. Welcome... 1 Actions... 2 AddInstanceGroups...
More informationBig Data Analysis Using Hadoop and MapReduce
Big Data Analysis Using Hadoop and MapReduce Harrison Carranza, MSIS Marist College, Harrison.Carranza2@marist.edu Mentor: Aparicio Carranza, PhD New York City College of Technology - CUNY, USA, acarranza@citytech.cuny.edu
More informationApache Spark and Scala Certification Training
About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over
More informationCreating an Inverted Index using Hadoop
Creating an Inverted Index using Hadoop Redeeming Google Cloud Credits 1. Go to https://goo.gl/gcpedu/zvmhm6 to redeem the $150 Google Cloud Platform Credit. Make sure you use your.edu email. 2. Follow
More informationMIGRATE2IAAS CLOUDSCRAPER TM V0.5 USER MANUAL. 16 Feb 2014 Copyright M2IAAS INC.
MIGRATE2IAAS CLOUDSCRAPER TM V0.5 USER MANUAL 16 Feb 2014 Copyright 2012-2014 M2IAAS INC http://www.migrate2iaas.com 1 Contents Download and install... 3 Start new transfer... 4 Before you begin... 4 1
More informationstdin, stdout, stderr
stdin, stdout, stderr stdout and stderr Many programs make output to "standard out" and "standard error" (e.g. the print command goes to standard out, error messages go to standard error). By default,
More informationCS 1110, LAB 3: MODULES AND TESTING First Name: Last Name: NetID:
CS 1110, LAB 3: MODULES AND TESTING http://www.cs.cornell.edu/courses/cs11102013fa/labs/lab03.pdf First Name: Last Name: NetID: The purpose of this lab is to help you better understand functions, and to
More informationUsing AVRO To Run Python Map Reduce Jobs
Using AVRO To Run Python Map Reduce Jobs Overview This article describes how AVRO can be used write hadoop map/reduce jobs in other languages. AVRO accomplishes this by providing a stock mapper/reducer
More informationMapReduce Algorithms
Large-scale data processing on the Cloud Lecture 3 MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation Commons Attribution 3.0 License) Outline
More informationAmazon Web Services (AWS) Setup Guidelines
Amazon Web Services (AWS) Setup Guidelines For CSE6242 HW3, updated version of the guidelines by Diana Maclean [Estimated time needed: 1 hour] Note that important steps are highlighted in yellow. What
More information/ Cloud Computing. Recitation 13 April 12 th 2016
15-319 / 15-619 Cloud Computing Recitation 13 April 12 th 2016 Overview Last week s reflection Project 4.1 Quiz 11 Budget issues Tagging, 15619Project This week s schedule Unit 5 - Modules 21 Project 4.2
More informationOutline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins
MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce
More informationLab 7c: Rainfall patterns and drainage density
Lab 7c: Rainfall patterns and drainage density This is the third of a four-part handout for class the last two weeks before spring break. Due: Be done with this by class on 11/3. Task: Extract your watersheds
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationBlended Learning Outline: Cloudera Data Analyst Training (171219a)
Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills
More informationHortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :
Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.
More informationClick Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors
More informationPractical Natural Language Processing with Senior Architect West Monroe Partners
Practical Natural Language Processing with Hadoop @DanRosanova Senior Architect West Monroe Partners A little about me & West Monroe Partners 15 years in technology consulting 5 time Microsoft Integration
More informationWeb-CAT Guidelines. 1. Logging into Web-CAT
Contents: 1. Logging into Web-CAT 2. Submitting Projects via jgrasp a. Configuring Web-CAT b. Submitting Individual Files (Example: Activity 1) c. Submitting a Project to Web-CAT d. Submitting in Web-CAT
More informationHow to Implement MapReduce Using. Presented By Jamie Pitts
How to Implement MapReduce Using Presented By Jamie Pitts A Problem Seeking A Solution Given a corpus of html-stripped financial filings: Identify and count unique subjects. Possible Solutions: 1. Use
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page
More informationThe body text of the page also has all newlines converted to spaces to ensure it stays on one line in this representation.
optionalattr="val2">(body) The body text of the page also has all newlines converted to spaces to ensure it stays on one line in this representation. MapReduce Steps: This presents the
More informationXP: Backup Your Important Files for Safety
XP: Backup Your Important Files for Safety X 380 / 1 Protect Your Personal Files Against Accidental Loss with XP s Backup Wizard Your computer contains a great many important files, but when it comes to
More informationMapReduce programming model
MapReduce programming model technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationAuto Print User s Manual
Auto Print User s Manual Welcome... 2 Configuring the Add-in... 3 AutoPrint Incoming Email Tab... 4 AutoPrint Outgoing Email Tab... 6 Print Settings Tab... 7 Print Now Tab... 9 Exceptions Tab... 10 Troubleshooting...
More informationMapReduce: Simplified Data Processing on Large Clusters 유연일민철기
MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,
More informationAWS Setup Guidelines
AWS Setup Guidelines For CSE6242 HW3, updated version of the guidelines by Diana Maclean Important steps are highlighted in yellow. What we will accomplish? This guideline helps you get set up with the
More informationGetting Started with Hadoop
Getting Started with Hadoop May 28, 2018 Michael Völske, Shahbaz Syed Web Technology & Information Systems Bauhaus-Universität Weimar 1 webis 2018 What is Hadoop Started in 2004 by Yahoo Open-Source implementation
More informationParallel Nested Loops
Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),
More informationParallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011
Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on
More informationReal-time Data Engineering in the Cloud Exercise Guide
Real-time Data Engineering in the Cloud Exercise Guide Jesse Anderson 2017 SMOKING HAND LLC ALL RIGHTS RESERVED Version 1.12.a9779239 1 Contents 1 Lab Notes 3 2 Kafka HelloWorld 6 3 Streaming ETL 8 4 Advanced
More informationMapReduce Patterns, Algorithms, and Use Cases
MapReduce Patterns, Algorithms, and Use Cases In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web
More informationWriting and Running Programs
Introduction to Python Writing and Running Programs Working with Lab Files These instructions take you through the steps of writing and running your first program, as well as using the lab files in our
More informationThis video is part of the Microsoft Virtual Academy.
This video is part of the Microsoft Virtual Academy. 1 In this session we re going to talk about building for the private cloud using the Microsoft deployment toolkit 2012, my name s Mike Niehaus, I m
More informationMolecular Statistics Exercise 1. As was shown to you this morning, the interactive python shell can add, subtract, multiply and divide numbers.
Molecular Statistics Exercise 1 Introduction This is the first exercise in the course Molecular Statistics. The exercises in this course are split in two parts. The first part of each exercise is a general
More informationDeploying Custom Step Plugins for Pentaho MapReduce
Deploying Custom Step Plugins for Pentaho MapReduce This page intentionally left blank. Contents Overview... 1 Before You Begin... 1 Pentaho MapReduce Configuration... 2 Plugin Properties Defined... 2
More informationData Partitioning and MapReduce
Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,
More information(Refer Slide Time: 1:09)
Computer Networks Prof. S. Ghosh Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecturer # 30 UDP and Client Server Good day, today we will start our discussion
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement
More informationA. Any Corps employee and any external customer or business partner who receives an invitation from a Corps user.
Attunity RepliWeb (A-RMFT) Frequently Asked Questions (FAQ) https://filetransfer.usace.army.mil Q. What is this software? A. Attunity RepliWeb Managed File Transfer is the File Transfer Protocol (FTP)
More informationCMSC 201 Fall 2017 Lab 12 File I/O
CMSC 201 Fall 2017 Lab 12 File I/O Assignment: Lab 12 File I/O Due Date: During discussion, November 27th through November 30th Value: 10 points (8 points during lab, 2 points for Pre Lab quiz) This week
More informationCIEL Tutorial. Connecting to your personal cluster
CIEL Tutorial This page provides instructions for setting up your own CIEL installation on Amazon's Elastic Compute Cluster (EC2), and running some basic Skywriting jobs. This tutorial is based largely
More informationMap Reduce & Hadoop Recommended Text:
Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange
More informationMake sure you have the latest Hive trunk by running svn up in your Hive directory. More detailed instructions on downloading and setting up
GenericUDAFCaseStudy Writing GenericUDAFs: A Tutorial User-Defined Aggregation Functions (UDAFs) are an excellent way to integrate advanced data-processing into Hive. Hive allows two varieties of UDAFs:
More informationPDI Techniques Logging and Monitoring
PDI Techniques Logging and Monitoring Change log (if you want to use it): Date Version Author Changes Contents Overview... 1 Before You Begin... 1 Terms You Should Know... 1 Use Case: Setting Appropriate
More informationOrganising your inbox
Outlook 2010 Tips Table of Contents Organising your inbox... 1 Categories... 1 Applying a Category to an E-mail... 1 Customising Categories... 1 Quick Steps... 2 Default Quick Steps... 2 To configure or
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationmrjob Documentation Release dev0 Steve Johnson
mrjob Documentation Release 0.6.3.dev0 Steve Johnson March 30, 2018 Contents 1 Guides 3 1.1 Why mrjob?............................................... 3 1.2 Fundamentals...............................................
More informationIntegrating Beamr Video Into a Video Encoding Workflow By: Jan Ozer
Integrating Beamr Video Into a Video Encoding Workflow By: Jan Ozer Beamr Video is a perceptual video optimizer that significantly reduces the bitrate of video streams without compromising quality, enabling
More informationChronix A fast and efficient time series storage based on Apache Solr. Caution: Contains technical content.
Chronix A fast and efficient time series storage based on Apache Solr Caution: Contains technical content. 68.000.000.000* time correlated data objects. How to store such amount of data on your laptop
More informationCS158 - Assignment 9 Faster Naive Bayes? Say it ain t so...
CS158 - Assignment 9 Faster Naive Bayes? Say it ain t so... Part 1 due: Sunday, Nov. 13 by 11:59pm Part 2 due: Sunday, Nov. 20 by 11:59pm http://www.hadoopwizard.com/what-is-hadoop-a-light-hearted-view/
More informationCom S 227 Assignment Submission HOWTO
Com S 227 Assignment Submission HOWTO This document provides detailed instructions on: 1. How to submit an assignment via Canvas and check it 3. How to examine the contents of a zip file 3. How to create
More informationHadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
More informationIllustrated Guide to the. UTeach. Electronic Portfolio
Illustrated Guide to the UTeach Electronic Portfolio UTeach Portfolio Guide 1 Revised Spring 2011 The Electronic Portfolio All UTeach students have access to the electronic portfolio. If you can t log
More informationLambda Architecture for Batch and Stream Processing. October 2018
Lambda Architecture for Batch and Stream Processing October 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only.
More informationActual4Dumps. Provide you with the latest actual exam dumps, and help you succeed
Actual4Dumps http://www.actual4dumps.com Provide you with the latest actual exam dumps, and help you succeed Exam : HDPCD Title : Hortonworks Data Platform Certified Developer Vendor : Hortonworks Version
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More information61A Lecture 36. Wednesday, November 30
61A Lecture 36 Wednesday, November 30 Project 4 Contest Gallery Prizes will be awarded for the winning entry in each of the following categories. Featherweight. At most 128 words of Logo, not including
More informationLecture Transcript While and Do While Statements in C++
Lecture Transcript While and Do While Statements in C++ Hello and welcome back. In this lecture we are going to look at the while and do...while iteration statements in C++. Here is a quick recap of some
More informationCSCI0931 Intro Comp for Humanities & Soc Sci Jun Ki Lee. Final Project. Rubric
CSCI0931 Intro Comp for Humanities & Soc Sci Jun Ki Lee Final Project Rubric Note: The Final Project is 30% of the total grade for this course. Name: Category Proposal 10 Meetings with Staff 5 Design Elements
More informationCC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018
CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018 Lecture 4: Apache Pig Aidan Hogan aidhog@gmail.com HADOOP: WRAPPING UP 0. Reading/Writing to HDFS Creates a file system for default configuration Check
More informationFor Volunteers An Elvanto Guide
For Volunteers An Elvanto Guide www.elvanto.com Volunteers are what keep churches running! This guide is for volunteers who use Elvanto. If you re in charge of volunteers, why not check out our Volunteer
More information