Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Size: px
Start display at page:

Download "Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)"

Transcription

1 Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Spring 2017, Prakash Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please) Reminders: a. Out of 100 points. Contains 4 pages. b. Rough time-estimates: 7-12 hours. c. Please type your answers. Illegible handwriting may get no points, at the discretion of the grader. Only drawings may be hand-drawn, as long as they are neat and legible. d. There could be more than one correct answer. We shall accept them all. e. Whenever you are making an assumption, please state it clearly. f. Each HW has to be done individually, without taking any help from non-class resources (e.g. websites etc). Q1. Map-Reduce [45 points] In this question, we will use Map Reduce to figure out the number of 2-grams in a large text corpus given the all the distinct 4-grams from the text corpus. The idea is to convince you that using Hadoop on AWS has now really become a low-enough cost/effort proposition (compared to setting up your own cluster). You can use one of Java/Python/Ruby to implement this question. You are free to use Hadoop Streaming as well if you want. Familiarize yourself with AWS (Amazon Web Services). Read the set-up guidelines posted on piazza---to set up your AWS account and redeem your free credit ($40). Do the setup early! Link to setup: Link to how to run a sample wordcount application on AWS: The pricing for various services provided by AWS can be found at The services we would be primarily using for this assignment are the Amazon S3 storage, the Amazon Elastic Cloud Computing (EC2) virtual servers in the cloud and the Amazon Elastic MapReduce (EMR) managed Hadoop framework. Play around with AWS and try to create MapReduce job flows (not required, or graded) or try the sample job flows on AWS. Of course, after you are done with the HW, feel free to use your remaining credits for any other fun computations/applications you may have in mind! These credits are applicable more generally for AWS as a whole, not just MapReduce. The questions in this assignment will ideally use up only a very small fraction of your $35 credit. AWS allows you to use up to 20 instances total (that means 1 master instance 1

2 and up to 19 core instances) without filling out a limit request form. For this assignment, you should not exceed this quota of 20 instances. You can learn about these instance types by going through the extensive AWS documentations. Very Important: You should always check how much credit you have left from time to time. Sometimes this takes a while to update. Always make sure to test your mappers and reducers on some sample local data before using the AWS. Note that if you run over, your card will be charged! Ideally for this assignment you would need a small fraction of your credits. To check how much credit you have spent go to Billing and Cost Management Dashboard from the AWS management console. The link to this page is in the dropdown under your name on the top right corner of the AWS management console. On this page, you can check the month-to-date credit you spent. You should also check the Bills page (sometimes that is updated more frequently than the credit page). The link to this page is in the left column of the Billing and Cost Management Dashboard page. Click on the Bills link. We will use data from the Google Books n-gram viewer corpus. N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. As we saw in class, the n specifies the number of elements in the tuple, so a 5-gram contains five words. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. The original dataset is available from The subset we will be using for this assignment is a subset of the 4-gram English 1M dataset in the following S3 bucket (directory) and is accessible to all: s3://cs vt-cs-data/eng-1m/ The data is in a simple txt file, and each row of the dataset is formatted like: ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE (note that the file is TAB delimited). For example, 2 sample lines in our dataset could be: analysis is often described analysis is often described where analysis is often described is a 4-gram and line tells us that it occurred 10 times in the year 1991 in 1 book in the Google books sample, 30 times in the year 1992 and so on. Refer to our setup guidelines to see how to set this data as input to your MapReduce job (Section 6 in the guidelines). We have provided a screenshot to configure the EMR cluster, which demonstrates how to access input data from some given bucket (here our bucket is the one given above). Q1.1. (10 points) What is the total number of distinct 4-grams in the dataset? Write a simple MapReduce job to compute this. 2

3 Q1.2. (5 points) Plot the frequency distribution for the occurrence counts of the 4- grams i.e. a plot where the x-axis is the occurrence count (say k), and y-axis is the number of 4-grams which occur k times. Show the plot in both linear-linear and log-log scales. Just paste the two figures as the answer. Hint: It will be easiest if you write a simple MR job to pull out just the occurrence information from the dataset, and then compute the distribution locally on your machine. Q1.3. (25 points) Write a MapReduce job to compute the total number of distinct 2- grams using the same dataset. Q1.4. (5 points) Write down the total number of the 2-grams you get (you may need another separate MR job for this---no need to show us the code for this---just write down the number). Note: In order to avoid extra charges, after you are done with your homework, do not forget to remove all your files in s3 buckets; i.e. the ones generated as output of 4-grams and 2-grams counts and any files you have uploaded. Code Deliverables: For Q1.1: Give the mapper and reducer files (in addition to the number). For Q1.3: Give the mapper and reducer for computing the 2-grams from the dataset. Zip all of these as YOUR-LASTNAME.zip and send it to Xiaolong ( xlwu@vt.edu) with the subject HW3-Code-Q1. Also copy-paste these in your hard copy. Q2. Finding Similar Items [30 points] Q2.1. (15 points) In class, we saw how to construct signature matrices using random permutations. Unfortunately, permuting rows of a large matrix is prohibitive. Section of your textbook (in Chapter 3) gives a fairly simple method to simulate this randomness using different hash functions. Please read through that section before attempting this question. Now consider matrix below. 3

4 a. (9 points) Compute the minhash signature for each column using the method given in Sec of your textbook, if we use the following three hash functions: (A) h1(x) = (2x + 1) mod 6; (B) h2(x) = (3x + 2) mod 6; and (C) h3(x) = (5x + 2) mod 6. So you will finally get a 3x4 matrix as the signature matrix. Just show the initially computed hash function values and the final signature matrix. b. (6 points) How close are the estimated Jaccard similarities for the six pairs of columns to the true Jaccard similarities (i.e. give the ratio of the estimated/true for each of the pairs)? Q2.2. (15 points) Recall that in LSH, given b and r, the probability that a pair of documents having Jaccard similarity s will be a candidate pair is given by the function f s = 1 1 s!!. a. (6 points) Show that s =!!/!! is a good approximation to the value of s when the slope of f(s) is the maximum. Hint: Feel free to use Mathematica if you want---will save you some time J b. (7 points) Recall that the threshold is the value of s at which the probability of becoming a candidate pair is ½. Given b=32, r=8, plot a graph of f(s) vs s (of course s should vary from 0 to 1) and numerically estimate the value of s when f(s) is ½ (call this value s1). Show the plot here, demonstrating your computation. c. (2 points) How does s1 compare to s* (for the same values of b=32 and r=8)? Q3. Stream Mining [20 points] Q3.1. (10 points) Bloom Filters: Suppose we have n bits of memory available, and our set S has m members. Instead of using k hash functions, we could divide the n bits into k arrays, and hash once to each array. As a function of n, m, and k, what is the probability of a false positive? How does it compare with using k hash functions into a single array? Q3.2. (5 points) AMS algorithm: Compute the surprise number (second moment) for the stream 3, 1, 4, 1, 3, 4, 2, 1, 2. What is the third moment of this stream? Q3.3. (5 points) DGIM algorithm: Suppose the window is as shown below. (Most recent bit is on the right) 4

5 Estimate the number of 1 s in the last k positions, for k = (a) 5 (b) 15. In each case, how far off the correct value is your estimate? Q4. Frequent Itemsets [5 points] Let there be I items in a market-basket data set of B baskets. Suppose that every basket contains exactly K items. As a function of I, B, and K, how much space does the triangular-matrix method take to store the counts of all pairs of items assuming four bytes per array element? 5

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in

More information

Amazon Web Services (AWS) Setup Guidelines

Amazon Web Services (AWS) Setup Guidelines Amazon Web Services (AWS) Setup Guidelines For CSE6242 HW3, updated version of the guidelines by Diana Maclean [Estimated time needed: 1 hour] Note that important steps are highlighted in yellow. What

More information

AWS Setup Guidelines

AWS Setup Guidelines AWS Setup Guidelines For CSE6242 HW3, updated version of the guidelines by Diana Maclean Important steps are highlighted in yellow. What we will accomplish? This guideline helps you get set up with the

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Goals Many Web-mining problems can be expressed as finding similar sets:. Pages with similar words, e.g., for classification

More information

Homework 1: RA, SQL and B+-Trees (due September 24 th, 2014, 2:30pm, in class hard-copy please)

Homework 1: RA, SQL and B+-Trees (due September 24 th, 2014, 2:30pm, in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 1: RA, SQL and B+-Trees (due September 24 th, 2014, 2:30pm, in class hard-copy please) Reminders: a. Out

More information

Homework 1: RA, SQL and B+-Trees (due Feb 7, 2017, 9:30am, in class hard-copy please)

Homework 1: RA, SQL and B+-Trees (due Feb 7, 2017, 9:30am, in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Spring 2017, Prakash Homework 1: RA, SQL and B+-Trees (due Feb 7, 2017, 9:30am, in class hard-copy please) Reminders: a. Out of 100

More information

Homework 1: Relational Algebra and SQL (due February 10 th, 2016, 4:00pm, in class hard-copy please)

Homework 1: Relational Algebra and SQL (due February 10 th, 2016, 4:00pm, in class hard-copy please) Virginia Tech. Computer Science CS 4604 Introduction to DBMS Spring 2016, Prakash Homework 1: Relational Algebra and SQL (due February 10 th, 2016, 4:00pm, in class hard-copy please) Reminders: a. Out

More information

Creating an Inverted Index using Hadoop

Creating an Inverted Index using Hadoop Creating an Inverted Index using Hadoop Redeeming Google Cloud Credits 1. Go to https://goo.gl/gcpedu/zvmhm6 to redeem the $150 Google Cloud Platform Credit. Make sure you use your.edu email. 2. Follow

More information

Homework 6: FDs, NFs and XML (due April 15 th, 2015, 4:00pm, hard-copy in-class please)

Homework 6: FDs, NFs and XML (due April 15 th, 2015, 4:00pm, hard-copy in-class please) Virginia Tech. Computer Science CS 4604 Introduction to DBMS Spring 2015, Prakash Homework 6: FDs, NFs and XML (due April 15 th, 2015, 4:00pm, hard-copy in-class please) Reminders: a. Out of 100 points.

More information

Project Assignment 2 (due April 6 th, 2016, 4:00pm, in class hard-copy please)

Project Assignment 2 (due April 6 th, 2016, 4:00pm, in class hard-copy please) Virginia Tech. Computer Science CS 4604 Introduction to DBMS Spring 2016, Prakash Project Assignment 2 (due April 6 th, 2016, 4:00pm, in class hard-copy please) Reminders: a. Out of 100 points. Contains

More information

Hadoop Exercise to Create an Inverted List

Hadoop Exercise to Create an Inverted List Hadoop Exercise to Create an Inverted List For this project you will be creating an Inverted Index of words occurring in a set of English books. We ll be using a collection of 3,036 English books written

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 53 INTRO. TO DATA MINING Locality Sensitive Hashing (LSH) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan Parthasarathy @OSU MMDS Secs. 3.-3.. Slides

More information

Homework 4: Query Processing, Query Optimization (due March 21 st, 2016, 4:00pm, in class hard-copy please)

Homework 4: Query Processing, Query Optimization (due March 21 st, 2016, 4:00pm, in class hard-copy please) Virginia Tech. Computer Science CS 4604 Introduction to DBMS Spring 2016, Prakash Homework 4: Query Processing, Query Optimization (due March 21 st, 2016, 4:00pm, in class hard-copy please) Reminders:

More information

Homework 2: Query Processing/Optimization, Transactions/Recovery (due February 16th, 2017, 9:30am, in class hard-copy please)

Homework 2: Query Processing/Optimization, Transactions/Recovery (due February 16th, 2017, 9:30am, in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Spring 2017, Prakash Homework 2: Query Processing/Optimization, Transactions/Recovery (due February 16th, 2017, 9:30am, in class hard-copy

More information

Homework 5: Miscellanea (due April 26 th, 2013, 9:05am, in class hard-copy please)

Homework 5: Miscellanea (due April 26 th, 2013, 9:05am, in class hard-copy please) Virginia Tech. Computer Science CS 4604 Introduction to DBMS Spring 2013, Prakash Homework 5: Miscellanea (due April 26 th, 2013, 9:05am, in class hard-copy please) Reminders: a. Out of 100 points. b.

More information

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014 CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions

More information

Homework 2: E/R Models and More SQL (due February 17 th, 2016, 4:00pm, in class hard-copy please)

Homework 2: E/R Models and More SQL (due February 17 th, 2016, 4:00pm, in class hard-copy please) Virginia Tech. Computer Science CS 4604 Introduction to DBMS Spring 2016, Prakash Homework 2: E/R Models and More SQL (due February 17 th, 2016, 4:00pm, in class hard-copy please) Reminders: a. Out of

More information

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University Shingling Minhashing Locality-Sensitive Hashing Jeffrey D. Ullman Stanford University 2 Wednesday, January 13 Computer Forum Career Fair 11am - 4pm Lawn between the Gates and Packard Buildings Policy for

More information

Homework 6: FDs, NFs and XML (due April 13 th, 2016, 4:00pm, hard-copy in-class please)

Homework 6: FDs, NFs and XML (due April 13 th, 2016, 4:00pm, hard-copy in-class please) Virginia Tech. Computer Science CS 4604 Introduction to DBMS Spring 2016, Prakash Homework 6: FDs, NFs and XML (due April 13 th, 2016, 4:00pm, hard-copy in-class please) Reminders: a. Out of 100 points.

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions

More information

Topic: Duplicate Detection and Similarity Computing

Topic: Duplicate Detection and Similarity Computing Table of Content Topic: Duplicate Detection and Similarity Computing Motivation Shingling for duplicate comparison Minhashing LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman

More information

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies. CSE547: Machine Learning for Big Data Spring 2019 Problem Set 1 Please read the homework submission policies. 1 Spark (25 pts) Write a Spark program that implements a simple People You Might Know social

More information

Project Assignment 2 (due April 6 th, 2015, 4:00pm, in class hard-copy please)

Project Assignment 2 (due April 6 th, 2015, 4:00pm, in class hard-copy please) Virginia Tech. Computer Science CS 4604 Introduction to DBMS Spring 2015, Prakash Project Assignment 2 (due April 6 th, 2015, 4:00pm, in class hard-copy please) Reminders: a. Out of 100 points. Contains

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 2 Task: Given a large number (N in the millions or

More information

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016 15-319 / 15-619 Cloud Computing Recitation 3 Sep 13 & 15, 2016 1 Overview Administrative Issues Last Week s Reflection Project 1.1, OLI Unit 1, Quiz 1 This Week s Schedule Project1.2, OLI Unit 2, Module

More information

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Due by 11:59:59pm on Tuesday, March 16, 2010 This assignment is based on a similar assignment developed at the University of Washington. Running

More information

CS246: Mining Massive Data Sets Winter Final

CS246: Mining Massive Data Sets Winter Final CS246: Mining Massive Data Sets Winter 2013 Final These questions require thought, but do not require long answers. Be as concise as possible. You have three hours to complete this final. The exam has

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 1/8/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 Supermarket shelf

More information

Homework 7: Transactions, Logging and Recovery (due April 22nd, 2015, 4:00pm, in class hard-copy please)

Homework 7: Transactions, Logging and Recovery (due April 22nd, 2015, 4:00pm, in class hard-copy please) Virginia Tech. Computer Science CS 4604 Introduction to DBMS Spring 2015, Prakash Homework 7: Transactions, Logging and Recovery (due April 22nd, 2015, 4:00pm, in class hard-copy please) Reminders: a.

More information

MapReduce programming model

MapReduce programming model MapReduce programming model technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 9: Data Mining (3/4) March 8, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on

More information

Activity 03 AWS MapReduce

Activity 03 AWS MapReduce Implementation Activity: A03 (version: 1.0; date: 04/15/2013) 1 6 Activity 03 AWS MapReduce Purpose 1. To be able describe the MapReduce computational model 2. To be able to solve simple problems with

More information

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing Some Interesting Applications of Theory PageRank Minhashing Locality-Sensitive Hashing 1 PageRank The thing that makes Google work. Intuition: solve the recursive equation: a page is important if important

More information

Map/Reduce on the Enron dataset

Map/Reduce on the Enron dataset Map/Reduce on the Enron dataset We are going to use EMR on the Enron email dataset: http://aws.amazon.com/datasets/enron-email-data/ https://en.wikipedia.org/wiki/enron_scandal This dataset contains 1,227,255

More information

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions

More information

Finding Similar Sets

Finding Similar Sets Finding Similar Sets V. CHRISTOPHIDES vassilis.christophides@inria.fr https://who.rocq.inria.fr/vassilis.christophides/big/ Ecole CentraleSupélec Motivation Many Web-mining problems can be expressed as

More information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (3/4) March 7, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Introduction: Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Clustering is an important machine learning task that tackles the problem of classifying data into distinct groups based

More information

HW 4: PageRank & MapReduce. 1 Warmup with PageRank and stationary distributions [10 points], collaboration

HW 4: PageRank & MapReduce. 1 Warmup with PageRank and stationary distributions [10 points], collaboration CMS/CS/EE 144 Assigned: 1/25/2018 HW 4: PageRank & MapReduce Guru: Joon/Cathy Due: 2/1/2018 at 10:30am We encourage you to discuss these problems with others, but you need to write up the actual solutions

More information

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)! Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm

More information

CS195H Homework 1 Grid homotopies and free groups. Due: February 5, 2015, before class

CS195H Homework 1 Grid homotopies and free groups. Due: February 5, 2015, before class CS195H Homework 1 Grid homotopies and free groups This second homework is almost all about grid homotopies and grid curves, but with a little math in the middle. This homework, last year, took people about

More information

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams. http://www.mmds.org High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Network Analysis

More information

CS 677 Distributed Operating Systems. Programming Assignment 3: Angry birds : Replication, Fault Tolerance and Cache Consistency

CS 677 Distributed Operating Systems. Programming Assignment 3: Angry birds : Replication, Fault Tolerance and Cache Consistency CS 677 Distributed Operating Systems Spring 2013 Programming Assignment 3: Angry birds : Replication, Fault Tolerance and Cache Consistency Due: Tue Apr 30 2013 You may work in groups of two for this lab

More information

SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS

SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS 1. What is SAP Vora? SAP Vora is an in-memory, distributed computing solution that helps organizations uncover actionable business insights

More information

Lesson 18: There is Only One Line Passing Through a Given Point with a Given

Lesson 18: There is Only One Line Passing Through a Given Point with a Given Lesson 18: There is Only One Line Passing Through a Given Point with a Given Student Outcomes Students graph equations in the form of using information about slope and intercept. Students know that if

More information

HOMEWORK 7. M. Neumann. Due: THU 8 MAR PM. Getting Started SUBMISSION INSTRUCTIONS

HOMEWORK 7. M. Neumann. Due: THU 8 MAR PM. Getting Started SUBMISSION INSTRUCTIONS CSE427S HOMEWORK 7 M. Neumann Due: THU 8 MAR 2018 1PM Getting Started Update your SVN repository. When needed, you will find additional materials for homework x in the folder hwx. So, for the current assignment

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data

More information

ECS 189G: Intro to Computer Vision, Spring 2015 Problem Set 3

ECS 189G: Intro to Computer Vision, Spring 2015 Problem Set 3 ECS 189G: Intro to Computer Vision, Spring 2015 Problem Set 3 Instructor: Yong Jae Lee (yjlee@cs.ucdavis.edu) TA: Ahsan Abdullah (aabdullah@ucdavis.edu) TA: Vivek Dubey (vvkdubey@ucdavis.edu) Due: Wednesday,

More information

CMU - SCS / Database Applications Spring 2013, C. Faloutsos Homework 1: E.R. + Formal Q.L. Deadline: 1:30pm on Tuesday, 2/5/2013

CMU - SCS / Database Applications Spring 2013, C. Faloutsos Homework 1: E.R. + Formal Q.L. Deadline: 1:30pm on Tuesday, 2/5/2013 CMU - SCS 15-415/15-615 Database Applications Spring 2013, C. Faloutsos Homework 1: E.R. + Formal Q.L. Deadline: 1:30pm on Tuesday, 2/5/2013 Reminders - IMPORTANT: Like all homeworks, it has to be done

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement

More information

MapReduce and Friends

MapReduce and Friends MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web

More information

Homework 2 (by Ao Zeng) Solutions Due: Friday Sept 28, 11:59pm

Homework 2 (by Ao Zeng) Solutions Due: Friday Sept 28, 11:59pm CARNEGIE MELLON UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE 15-445/645 DATABASE SYSTEMS (FALL 2018) PROF. ANDY PAVLO Homework 2 (by Ao Zeng) Solutions Due: Friday Sept 28, 2018 @ 11:59pm IMPORTANT: Upload

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Dis Commodity Clusters Web data sets can be ery large Tens to hundreds of terabytes

More information

10605 BigML Assignment 1(b): Naive Bayes with Hadoop API

10605 BigML Assignment 1(b): Naive Bayes with Hadoop API 10605 BigML Assignment 1(b): Naive Bayes with Hadoop API Due: Thursday, Sept. 15, 2016 23:59 EST via Autolab August 2, 2017 Policy on Collaboration among Students These policies are the same as were used

More information

Exam Datastrukturer. DIT960 / DIT961, VT-18 Göteborgs Universitet, CSE

Exam Datastrukturer. DIT960 / DIT961, VT-18 Göteborgs Universitet, CSE Exam Datastrukturer DIT960 / DIT961, VT-18 Göteborgs Universitet, CSE Day: 2018-10-12, Time: 8:30-12.30, Place: SB Course responsible Alex Gerdes, tel. 031-772 6154. Will visit at around 9:30 and 11:00.

More information

Finding Similar Items:Nearest Neighbor Search

Finding Similar Items:Nearest Neighbor Search Finding Similar Items:Nearest Neighbor Search Barna Saha February 23, 2016 Finding Similar Items A fundamental data mining task Finding Similar Items A fundamental data mining task May want to find whether

More information

Problem Set 0. General Instructions

Problem Set 0. General Instructions CS246: Mining Massive Datasets Winter 2014 Problem Set 0 Due 9:30am January 14, 2014 General Instructions This homework is to be completed individually (no collaboration is allowed). Also, you are not

More information

Data Partitioning Method for Mining Frequent Itemset Using MapReduce

Data Partitioning Method for Mining Frequent Itemset Using MapReduce 1st International Conference on Applied Soft Computing Techniques 22 & 23.04.2017 In association with International Journal of Scientific Research in Science and Technology Data Partitioning Method for

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

CSE141 Problem Set #4 Solutions

CSE141 Problem Set #4 Solutions CSE141 Problem Set #4 Solutions March 5, 2002 1 Simple Caches For this first problem, we have a 32 Byte cache with a line length of 8 bytes. This means that we have a total of 4 cache blocks (cache lines)

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

CS 385 Operating Systems Fall 2011 Homework Assignment 4 Simulation of a Memory Paging System

CS 385 Operating Systems Fall 2011 Homework Assignment 4 Simulation of a Memory Paging System CS 385 Operating Systems Fall 2011 Homework Assignment 4 Simulation of a Memory Paging System Due: Tuesday November 15th. Electronic copy due at 2:30, optional paper copy at the beginning of class. Overall

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining

More information

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017 CPSC 340: Machine Learning and Data Mining Finding Similar Items Fall 2017 Assignment 1 is due tonight. Admin 1 late day to hand in Monday, 2 late days for Wednesday. Assignment 2 will be up soon. Start

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Lab 3 Pig, Hive, and JAQL

Lab 3 Pig, Hive, and JAQL Lab 3 Pig, Hive, and JAQL Lab objectives In this lab you will practice what you have learned in this lesson, specifically you will practice with Pig, Hive, and Jaql languages. Lab instructions This lab

More information

Mining Data Streams. Chapter The Stream Data Model

Mining Data Streams. Chapter The Stream Data Model Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make

More information

CSCI6900 Assignment 1: Naïve Bayes on Hadoop

CSCI6900 Assignment 1: Naïve Bayes on Hadoop DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 1: Naïve Bayes on Hadoop DUE: Friday, January 29 by 11:59:59pm Out January 8, 2015 1 INTRODUCTION TO NAÏVE BAYES Much of machine

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

CSE 344 Introduc/on to Data Management. Sec/on 9: AWS, Hadoop, Pig La/n Yuyin Sun

CSE 344 Introduc/on to Data Management. Sec/on 9: AWS, Hadoop, Pig La/n Yuyin Sun CSE 344 Introduc/on to Data Management Sec/on 9: AWS, Hadoop, Pig La/n Yuyin Sun Homework 8 (Last hw J) 0.5 TB (yes, TeraBytes!) of data 251 files of ~ 2GB each btc-2010-chunk-000 to btc-2010-chunk-317

More information

Association Pattern Mining. Lijun Zhang

Association Pattern Mining. Lijun Zhang Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms

More information

CS143 Introduction to Computer Vision Homework assignment 1.

CS143 Introduction to Computer Vision Homework assignment 1. CS143 Introduction to Computer Vision Homework assignment 1. Due: Problem 1 & 2 September 23 before Class Assignment 1 is worth 15% of your total grade. It is graded out of a total of 100 (plus 15 possible

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 6: Similar Item Detection Jimmy Lin University of Maryland Thursday, February 28, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Understanding the ViewPoint WFM Flat File

Understanding the ViewPoint WFM Flat File Introduction The purpose of this document is to provide a detailed description of what the WFM flat file is and how to configure ViewPoint to import it. Overview In order to provide complete technician

More information

ASSIGNMENT 2 Conditionals, Loops, Utility Bills, and Calculators

ASSIGNMENT 2 Conditionals, Loops, Utility Bills, and Calculators ASSIGNMENT 2 Conditionals, Loops, Utility Bills, and Calculators COMP-202A, Fall 2009, All Sections Due: Friday, October 9, 2009 (23:55) You MUST do this assignment individually and, unless otherwise specified,

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

Sec 4.1 Coordinates and Scatter Plots. Coordinate Plane: Formed by two real number lines that intersect at a right angle.

Sec 4.1 Coordinates and Scatter Plots. Coordinate Plane: Formed by two real number lines that intersect at a right angle. Algebra I Chapter 4 Notes Name Sec 4.1 Coordinates and Scatter Plots Coordinate Plane: Formed by two real number lines that intersect at a right angle. X-axis: The horizontal axis Y-axis: The vertical

More information

ML from Large Datasets

ML from Large Datasets 10-605 ML from Large Datasets 1 Announcements HW1b is going out today You should now be on autolab have a an account on stoat a locally-administered Hadoop cluster shortly receive a coupon for Amazon Web

More information

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

MATE-EC2: A Middleware for Processing Data with Amazon Web Services MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU !2 MapReduce Overview! Sometimes a single computer cannot process data or takes too long traditional serial programming is not always

More information

Cloud Storage with AWS: EFS vs EBS vs S3 AHMAD KARAWASH

Cloud Storage with AWS: EFS vs EBS vs S3 AHMAD KARAWASH Cloud Storage with AWS: EFS vs EBS vs S3 AHMAD KARAWASH Cloud Storage with AWS Cloud storage is a critical component of cloud computing, holding the information used by applications. Big data analytics,

More information

CS 283: Assignment 1 Geometric Modeling and Mesh Simplification

CS 283: Assignment 1 Geometric Modeling and Mesh Simplification CS 283: Assignment 1 Geometric Modeling and Mesh Simplification Ravi Ramamoorthi 1 Introduction This assignment is about triangle meshes as a tool for geometric modeling. As the complexity of models becomes

More information

ECITE Cloud Platform User Manual. User Manual. AWS Platform. Powered By Dynamic Computing Cloud (DC2)

ECITE Cloud Platform User Manual. User Manual. AWS Platform. Powered By Dynamic Computing Cloud (DC2) User Manual EarthCube Integration and Test Environment Hybrid Cloud Platform AWS Platform Powered By Dynamic Computing Cloud (DC2) 1 Content Preface... 5 1. Basic Operation... 6 1.1 Login... 6 1.2 User

More information

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Part A: MapReduce. Introduction Model Implementation issues

Part A: MapReduce. Introduction Model Implementation issues Part A: Massive Parallelism li with MapReduce Introduction Model Implementation issues Acknowledgements Map-Reduce The material is largely based on material from the Stanford cources CS246, CS345A and

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #6: Mining Data Streams Seoul National University 1 Outline Overview Sampling From Data Stream Queries Over Sliding Window 2 Data Streams In many data mining situations,

More information

We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long

We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long 1/21/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 1 We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long Requires proving theorems

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Homework 1 Due October 4, 2015

Homework 1 Due October 4, 2015 CS205, Fall 2015 Computing Foundations for Computational Science Ray Jones Homework 1 Due October 4, 2015 This homework will be an exploration of the uses of Spark. In this assignment, you will run jobs

More information

Soundburst has been a music provider for Jazzercise since Our site is tailored just for Jazzercise instructors. We keep two years of full

Soundburst has been a music provider for Jazzercise since Our site is tailored just for Jazzercise instructors. We keep two years of full Soundburst has been a music provider for Jazzercise since 2001. Our site is tailored just for Jazzercise instructors. We keep two years of full R-sets and at least four years of individual tracks on our

More information

InPOsition App: Frequently Asked Questions

InPOsition App: Frequently Asked Questions InPOsition App: Frequently Asked Questions How do I download the mobile app? If you have an Android, you will go to Google Play. If you have an iphone, you will go to the App Store. Then search, In Position

More information

Introduction to Data Mining and Data Analytics

Introduction to Data Mining and Data Analytics 1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns

More information

The body text of the page also has all newlines converted to spaces to ensure it stays on one line in this representation.

The body text of the page also has all newlines converted to spaces to ensure it stays on one line in this representation. optionalattr="val2">(body) The body text of the page also has all newlines converted to spaces to ensure it stays on one line in this representation. MapReduce Steps: This presents the

More information