Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Similar documents
Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Amazon Web Services (AWS) Setup Guidelines

AWS Setup Guidelines

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Homework 1: RA, SQL and B+-Trees (due September 24 th, 2014, 2:30pm, in class hard-copy please)

Homework 1: RA, SQL and B+-Trees (due Feb 7, 2017, 9:30am, in class hard-copy please)

Homework 1: Relational Algebra and SQL (due February 10 th, 2016, 4:00pm, in class hard-copy please)

Creating an Inverted Index using Hadoop

Homework 6: FDs, NFs and XML (due April 15 th, 2015, 4:00pm, hard-copy in-class please)

Project Assignment 2 (due April 6 th, 2016, 4:00pm, in class hard-copy please)

Hadoop Exercise to Create an Inverted List

CSE 5243 INTRO. TO DATA MINING

Homework 4: Query Processing, Query Optimization (due March 21 st, 2016, 4:00pm, in class hard-copy please)

Homework 2: Query Processing/Optimization, Transactions/Recovery (due February 16th, 2017, 9:30am, in class hard-copy please)

Homework 5: Miscellanea (due April 26 th, 2013, 9:05am, in class hard-copy please)

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Homework 2: E/R Models and More SQL (due February 17 th, 2016, 4:00pm, in class hard-copy please)

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University

Homework 6: FDs, NFs and XML (due April 13 th, 2016, 4:00pm, hard-copy in-class please)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Topic: Duplicate Detection and Similarity Computing

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.

Project Assignment 2 (due April 6 th, 2015, 4:00pm, in class hard-copy please)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

CS246: Mining Massive Data Sets Winter Final

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Homework 7: Transactions, Logging and Recovery (due April 22nd, 2015, 4:00pm, in class hard-copy please)

MapReduce programming model

Introduction to Data Management CSE 344

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Jeffrey D. Ullman Stanford University

Activity 03 AWS MapReduce

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Map/Reduce on the Enron dataset

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Finding Similar Sets

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

HW 4: PageRank & MapReduce. 1 Warmup with PageRank and stationary distributions [10 points], collaboration

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

CS195H Homework 1 Grid homotopies and free groups. Due: February 5, 2015, before class

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

CS 677 Distributed Operating Systems. Programming Assignment 3: Angry birds : Replication, Fault Tolerance and Cache Consistency

SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS

Lesson 18: There is Only One Line Passing Through a Given Point with a Given

HOMEWORK 7. M. Neumann. Due: THU 8 MAR PM. Getting Started SUBMISSION INSTRUCTIONS

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

ECS 189G: Intro to Computer Vision, Spring 2015 Problem Set 3

CMU - SCS / Database Applications Spring 2013, C. Faloutsos Homework 1: E.R. + Formal Q.L. Deadline: 1:30pm on Tuesday, 2/5/2013

TI2736-B Big Data Processing. Claudia Hauff

MapReduce and Friends

Homework 2 (by Ao Zeng) Solutions Due: Friday Sept 28, 11:59pm

Data Partitioning and MapReduce

CS 345A Data Mining. MapReduce

10605 BigML Assignment 1(b): Naive Bayes with Hadoop API

Exam Datastrukturer. DIT960 / DIT961, VT-18 Göteborgs Universitet, CSE

Finding Similar Items:Nearest Neighbor Search

Problem Set 0. General Instructions

Data Partitioning Method for Mining Frequent Itemset Using MapReduce

Developing MapReduce Programs

CSE141 Problem Set #4 Solutions

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

CS 385 Operating Systems Fall 2011 Homework Assignment 4 Simulation of a Memory Paging System

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

Big Data Analytics CSCI 4030

Lab 3 Pig, Hive, and JAQL

Mining Data Streams. Chapter The Stream Data Model

CSCI6900 Assignment 1: Naïve Bayes on Hadoop

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

CSE 344 Introduc/on to Data Management. Sec/on 9: AWS, Hadoop, Pig La/n Yuyin Sun

Association Pattern Mining. Lijun Zhang

CS143 Introduction to Computer Vision Homework assignment 1.

Data-Intensive Computing with MapReduce

Understanding the ViewPoint WFM Flat File

ASSIGNMENT 2 Conditionals, Loops, Utility Bills, and Calculators

CS 345A Data Mining. MapReduce

Sec 4.1 Coordinates and Scatter Plots. Coordinate Plane: Formed by two real number lines that intersect at a right angle.

ML from Large Datasets

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Cloud Storage with AWS: EFS vs EBS vs S3 AHMAD KARAWASH

CS 283: Assignment 1 Geometric Modeling and Mesh Simplification

ECITE Cloud Platform User Manual. User Manual. AWS Platform. Powered By Dynamic Computing Cloud (DC2)

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Part A: MapReduce. Introduction Model Implementation issues

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Introduction to Data Mining

We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long

Weka ( )

Homework 1 Due October 4, 2015

Soundburst has been a music provider for Jazzercise since Our site is tailored just for Jazzercise instructors. We keep two years of full

InPOsition App: Frequently Asked Questions

Introduction to Data Mining and Data Analytics

The body text of the page also has all newlines converted to spaces to ensure it stays on one line in this representation.

Transcription:

Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Spring 2017, Prakash Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please) Reminders: a. Out of 100 points. Contains 4 pages. b. Rough time-estimates: 7-12 hours. c. Please type your answers. Illegible handwriting may get no points, at the discretion of the grader. Only drawings may be hand-drawn, as long as they are neat and legible. d. There could be more than one correct answer. We shall accept them all. e. Whenever you are making an assumption, please state it clearly. f. Each HW has to be done individually, without taking any help from non-class resources (e.g. websites etc). Q1. Map-Reduce [45 points] In this question, we will use Map Reduce to figure out the number of 2-grams in a large text corpus given the all the distinct 4-grams from the text corpus. The idea is to convince you that using Hadoop on AWS has now really become a low-enough cost/effort proposition (compared to setting up your own cluster). You can use one of Java/Python/Ruby to implement this question. You are free to use Hadoop Streaming as well if you want. Familiarize yourself with AWS (Amazon Web Services). Read the set-up guidelines posted on piazza---to set up your AWS account and redeem your free credit ($40). Do the setup early! Link to setup: http://people.cs.vt.edu/~badityap/classes/cs5614-spr17/homeworks/hw3/aws-setup.pdf Link to how to run a sample wordcount application on AWS: http://people.cs.vt.edu/~badityap/classes/cs5614-spr17/homeworks/hw3/sbs-aws-wordcount.pdf The pricing for various services provided by AWS can be found at http://aws.amazon.com/pricing/. The services we would be primarily using for this assignment are the Amazon S3 storage, the Amazon Elastic Cloud Computing (EC2) virtual servers in the cloud and the Amazon Elastic MapReduce (EMR) managed Hadoop framework. Play around with AWS and try to create MapReduce job flows (not required, or graded) or try the sample job flows on AWS. Of course, after you are done with the HW, feel free to use your remaining credits for any other fun computations/applications you may have in mind! These credits are applicable more generally for AWS as a whole, not just MapReduce. The questions in this assignment will ideally use up only a very small fraction of your $35 credit. AWS allows you to use up to 20 instances total (that means 1 master instance 1

and up to 19 core instances) without filling out a limit request form. For this assignment, you should not exceed this quota of 20 instances. You can learn about these instance types by going through the extensive AWS documentations. Very Important: You should always check how much credit you have left from time to time. Sometimes this takes a while to update. Always make sure to test your mappers and reducers on some sample local data before using the AWS. Note that if you run over, your card will be charged! Ideally for this assignment you would need a small fraction of your credits. To check how much credit you have spent go to Billing and Cost Management Dashboard from the AWS management console. The link to this page is in the dropdown under your name on the top right corner of the AWS management console. On this page, you can check the month-to-date credit you spent. You should also check the Bills page (sometimes that is updated more frequently than the credit page). The link to this page is in the left column of the Billing and Cost Management Dashboard page. Click on the Bills link. We will use data from the Google Books n-gram viewer corpus. N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. As we saw in class, the n specifies the number of elements in the tuple, so a 5-gram contains five words. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. The original dataset is available from http://books.google.com/ngrams/. The subset we will be using for this assignment is a subset of the 4-gram English 1M dataset in the following S3 bucket (directory) and is accessible to all: s3://cs5614-2017-vt-cs-data/eng-1m/ The data is in a simple txt file, and each row of the dataset is formatted like: ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE (note that the file is TAB delimited). For example, 2 sample lines in our dataset could be: analysis is often described 1991 10 1 1 analysis is often described 1992 30 2 1 where analysis is often described is a 4-gram and line tells us that it occurred 10 times in the year 1991 in 1 book in the Google books sample, 30 times in the year 1992 and so on. Refer to our setup guidelines to see how to set this data as input to your MapReduce job (Section 6 in the guidelines). We have provided a screenshot to configure the EMR cluster, which demonstrates how to access input data from some given bucket (here our bucket is the one given above). Q1.1. (10 points) What is the total number of distinct 4-grams in the dataset? Write a simple MapReduce job to compute this. 2

Q1.2. (5 points) Plot the frequency distribution for the occurrence counts of the 4- grams i.e. a plot where the x-axis is the occurrence count (say k), and y-axis is the number of 4-grams which occur k times. Show the plot in both linear-linear and log-log scales. Just paste the two figures as the answer. Hint: It will be easiest if you write a simple MR job to pull out just the occurrence information from the dataset, and then compute the distribution locally on your machine. Q1.3. (25 points) Write a MapReduce job to compute the total number of distinct 2- grams using the same dataset. Q1.4. (5 points) Write down the total number of the 2-grams you get (you may need another separate MR job for this---no need to show us the code for this---just write down the number). Note: In order to avoid extra charges, after you are done with your homework, do not forget to remove all your files in s3 buckets; i.e. the ones generated as output of 4-grams and 2-grams counts and any files you have uploaded. Code Deliverables: For Q1.1: Give the mapper and reducer files (in addition to the number). For Q1.3: Give the mapper and reducer for computing the 2-grams from the dataset. Zip all of these as YOUR-LASTNAME.zip and send it to Xiaolong (email: xlwu@vt.edu) with the subject HW3-Code-Q1. Also copy-paste these in your hard copy. Q2. Finding Similar Items [30 points] Q2.1. (15 points) In class, we saw how to construct signature matrices using random permutations. Unfortunately, permuting rows of a large matrix is prohibitive. Section 3.3.5 of your textbook (in Chapter 3) gives a fairly simple method to simulate this randomness using different hash functions. Please read through that section before attempting this question. Now consider matrix below. 3

a. (9 points) Compute the minhash signature for each column using the method given in Sec 3.3.5 of your textbook, if we use the following three hash functions: (A) h1(x) = (2x + 1) mod 6; (B) h2(x) = (3x + 2) mod 6; and (C) h3(x) = (5x + 2) mod 6. So you will finally get a 3x4 matrix as the signature matrix. Just show the initially computed hash function values and the final signature matrix. b. (6 points) How close are the estimated Jaccard similarities for the six pairs of columns to the true Jaccard similarities (i.e. give the ratio of the estimated/true for each of the pairs)? Q2.2. (15 points) Recall that in LSH, given b and r, the probability that a pair of documents having Jaccard similarity s will be a candidate pair is given by the function f s = 1 1 s!!. a. (6 points) Show that s =!!/!! is a good approximation to the value of s when the slope of f(s) is the maximum. Hint: Feel free to use Mathematica if you want---will save you some time J b. (7 points) Recall that the threshold is the value of s at which the probability of becoming a candidate pair is ½. Given b=32, r=8, plot a graph of f(s) vs s (of course s should vary from 0 to 1) and numerically estimate the value of s when f(s) is ½ (call this value s1). Show the plot here, demonstrating your computation. c. (2 points) How does s1 compare to s* (for the same values of b=32 and r=8)? Q3. Stream Mining [20 points] Q3.1. (10 points) Bloom Filters: Suppose we have n bits of memory available, and our set S has m members. Instead of using k hash functions, we could divide the n bits into k arrays, and hash once to each array. As a function of n, m, and k, what is the probability of a false positive? How does it compare with using k hash functions into a single array? Q3.2. (5 points) AMS algorithm: Compute the surprise number (second moment) for the stream 3, 1, 4, 1, 3, 4, 2, 1, 2. What is the third moment of this stream? Q3.3. (5 points) DGIM algorithm: Suppose the window is as shown below. (Most recent bit is on the right) 4

Estimate the number of 1 s in the last k positions, for k = (a) 5 (b) 15. In each case, how far off the correct value is your estimate? Q4. Frequent Itemsets [5 points] Let there be I items in a market-basket data set of B baskets. Suppose that every basket contains exactly K items. As a function of I, B, and K, how much space does the triangular-matrix method take to store the counts of all pairs of items assuming four bytes per array element? 5