The body text of the page also has all newlines converted to spaces to ensure it stays on one line in this representation.

Similar documents
Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Hadoop Map Reduce 10/17/2018 1

ML from Large Datasets

Lab 1: Silver Dollar Game 1 CSCI 2101B Fall 2018

Peer-To-Peer Network!

MapReduce, Hadoop and Spark. Bompotas Agorakis

Stage 11 Array Practice With. Zip Code Encoding

Dept. Of Computer Science, Colorado State University

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

GEO 425: SPRING 2012 LAB 9: Introduction to Postgresql and SQL

Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce

Assignment #1: /Survey and Karel the Robot Karel problems due: 1:30pm on Friday, October 7th

CSS BASICS. selector { property: value; }

Mehran Sahami Handout #7 CS 106A September 24, 2014

CS131 Compilers: Programming Assignment 2 Due Tuesday, April 4, 2017 at 11:59pm

CS 1044 Program 6 Summer I dimension ??????

HOMEWORK 9. M. Neumann. Due: THU 5 APR PM. Getting Started SUBMISSION INSTRUCTIONS

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Laboratory 5: Implementing Loops and Loop Control Strategies

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

COMP 105 Homework: Type Systems

CS 1110, LAB 3: MODULES AND TESTING First Name: Last Name: NetID:

EE 330 Laboratory 3 Layout, DRC, and LVS Fall 2015

COMP26120 Academic Session: Lab Exercise 2: Input/Output; Strings and Program Parameters; Error Handling

Assignment #1: and Karel the Robot Karel problems due: 3:15pm on Friday, October 4th due: 11:59pm on Sunday, October 6th

by Jimmy's Value World Ashish H Thakkar

HOMEWORK 7. M. Neumann. Due: THU 8 MAR PM. Getting Started SUBMISSION INSTRUCTIONS

README Document. LS- DYNA MPP Program Manager for Windows. Version 1.0 Release: June 10, Welcome! Quick Start Workflow

EE 330 Laboratory 3 Layout, DRC, and LVS

Activity 03 AWS MapReduce

Parallel Dijkstra s Algorithm

15-110: Principles of Computing, Spring 2018

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Parallel Programming Concepts

Database Applications (15-415)

Lab # 2. For today s lab:

HOMEWORK 8. M. Neumann. Due: THU 1 NOV PM. Getting Started SUBMISSION INSTRUCTIONS

L22: SC Report, Map Reduce

CSCI 305 Concepts of Programming Languages. Programming Lab 1

CSci 1113, Fall 2015 Lab Exercise 7 (Week 8): Arrays! Strings! Recursion! Oh my!

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

SCRATCH MODULE 3: NUMBER CONVERSIONS

COSC 6339 Big Data Analytics. Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout. Edgar Gabriel Fall Pig

Homework 09. Collecting Beepers

Hadoop Exercise to Create an Inverted List

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

During the first 2 weeks of class, all students in the course will take an in-lab programming exam. This is the Exam in Programming Proficiency.

Biocomputing II Coursework guidance

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

USING DRUPAL. Hampshire College Website Editors Guide

CSCI 1100L: Topics in Computing Lab Lab 07: Microsoft Access (Databases) Part I: Movie review database.

Big Data XML Parsing in Pentaho Data Integration (PDI)

CS 283: Assignment 1 Geometric Modeling and Mesh Simplification

Molecular Statistics Exercise 1. As was shown to you this morning, the interactive python shell can add, subtract, multiply and divide numbers.

EE 101 Homework 4 Redekopp Name: Due: See Blackboard

n! = 1 * 2 * 3 * 4 * * (n-1) * n

CS 3204 Operating Systems Programming Project #2 Job / CPU Scheduling Dr. Sallie Henry Spring 2001 Due on February 27, 2001.

Lab #2 Physics 91SI Spring 2013

Carrera: Analista de Sistemas/Licenciatura en Sistemas. Asignatura: Programación Orientada a Objetos

CS1622. Semantic Analysis. The Compiler So Far. Lecture 15 Semantic Analysis. How to build symbol tables How to use them to find

Project #1: Tracing, System Calls, and Processes

Programming Models MapReduce

CS 426 Fall Machine Problem 1. Machine Problem 1. CS 426 Compiler Construction Fall Semester 2017

Hadoop MapReduce Framework

Lab Assignment. BeFuddled Tasks. Lab 7: Intermediate Hadoop Programs. Assignment Preparation. Overview

Carleton University Department of Systems and Computer Engineering SYSC Foundations of Imperative Programming - Winter Lab 8 - Structures

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

2/26/2017. For instance, consider running Word Count across 20 splits

Distributed Computation Models

Chapter 4 Lab. Loops and Files. Objectives. Introduction

In this exercise you will practice some more SQL queries. First let s practice queries on a single table.

Chapter 29B: More about Strings Bradley Kjell (Revised 06/12/2008)

Clustering Documents. Case Study 2: Document Retrieval

MATLAB Demo. Preliminaries and Getting Started with Matlab

Matlab for FMRI Module 1: the basics Instructor: Luis Hernandez-Garcia

Databases and Big Data Today. CS634 Class 22

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

13 File Structures. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to:

CS 102 Lab 3 Fall 2012

Introduction to Ardora

Homework # 7 Distributed Computing due Saturday, December 13th, 2:00 PM

Setting up a ColdFusion Workstation

Creating N-gram profile for a Wikipedia Corpus

Lab 5: A Non-Blocking Instruction Cache

6.034 Artificial Intelligence, Fall 2006 Prof. Patrick H. Winston. Problem Set 0

APPENDIX B. Fortran Hints

6.830 Problem Set 3: SimpleDB Transactions

CS 051 Homework Laboratory #2

Homework Assignment #3

PROFESSOR: Last time, we took a look at an explicit control evaluator for Lisp, and that bridged the gap between

Spring CS Homework 12 p. 1. CS Homework 12

Introduction to JUnit. Data Structures and Algorithms for Language Processing

Tips from the experts: How to waste a lot of time on this assignment

6.170 Laboratory in Software Engineering Fall 2005 Problem Set 5: Stata Center Navigation and Analysis Due Date: Oct 20, 2005

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Conversion of the YDP Learning Content to Common Cartridge Package

STATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak

Transcription:

optionalattr="val2">(body)</text></revision> The body text of the page also has all newlines converted to spaces to ensure it stays on one line in this representation. MapReduce Steps: This presents the high-level requirements of what each phase of the program should do: (While there may be other, equivalent implementations of PageRank, this suggests one such method that can be implemented in this lab.) Step 1: Create Link Graph Process individual lines of the input corpus, one at a time. These lines contain the XML representations of individual pages of Wikipedia. Turn this into a link graph and initial page rank values (Use 1-d as your initial PageRank value). Think: What output key do you need to use? What data needs to be in the output value? Step 2: Process PageRank This is the component which will be run in your main loop. The output of this section should be directly readable as the input of this same step, so that you can run it multiple times. In this section, you should divide fragments of the input PageRank up among the links on a page, and then recombine all the fragments received by a page into the next iteration of PageRank. Step 3: Cleanup and Sorting The goal of this lab is to understand which pages on Wikipedia have a high PageRank value. Therefore, we use one more "cleanup" pass to extract this data into a form we can inspect. Strip away any data that you were using during the repetitions of step 2 so that the output is just a mapping between page names and PageRank values. We would like our output data sorted by PageRank value. Hint: Use only 1 reducer to sort everything. What should your key and value be to sort things by PageRank? At this point, the data can be inspected and the most highly-ranked pages can be determined. CodeLab Exercise: Implement the PageRank algorithm described above. You will need a driver class to run this process, which should run the link graph generator, calculate PageRank for 10 iterations, and then run the cleanup pass. Run PageRank, and find out what the top ten highest-pagerank pages are. Overall advice: Our local copy of wikipedia is stored in "/shared/wikipedia" on the DFS. Use this as your first input directory. Do NOT use it as your output directory.

And, do not use it at all until you are sure you've got everything working! There is a test data set stored in "/shared/wiki_small" on the DFS-- this contains about 100,000 articles. Use the data from this source as your first input directory to test your system. Move up to the full copy when you are convinced that your system works. SequenceFiles are a special binary format used by Hadoop for fast intermediate I/O. The output of the link graph generator, the input and output of the PageRank cycle, and the input to the cleanup pass, can all be set to org.apache.hadoop.mapred.sequenceinputformat and SequenceOutputFormat for faster processing, if you don't need the intermediate values for debugging. Test running a single pass of the PageRank mapper/reducer before putting it in a loop Each pass will require its own input and output directory; one output directory is used as the input directory for the next pass of the algorithm. Set the input and output directory names in the JobConf to values that make sense for this flow. Create a new JobClient and JobConf object for each MapReduce pass. main() should call a series of driver methods. Remember that you need to remove your intermediate/output directories between executions of your program The input and output types for each of these passes should be Text. You should design a textual representation for the data that must be passed through each phase, that you can serialize to and parse from efficiently. Alternatively, if you're feeling daring, implement your own subclass of Writable which includes the information you need. Set the number of map tasks to 300 (this is based on our cluster size) Set the number of reduce tasks to 50. The PageRank for each page will be a very small floating-point number. You may want to multiply all PageRank values by a constant 10,000 or so in the cleanup step to make these numbers more readable. The final cleanup step should have 1 reduce task, to get a single list of all pages. Don't forget, this is "real" data. We've done most of the dirty work for you in terms of formatting the input into a presentable manor, but there might be lines which don't conform to the layout you expect, blank lines, etc. Your code must be robust to these parsing errors. Just ignore any lines that are illegal -- but don't cause a crash! Start early. This project represents considerably more programming effort than project 1. Remember, project 1 provides instructions on how to set up a project and execute a program on the Hadoop cluster. Refer back to the instructions in that document for how to do this assignment as well. Testing Your Code: If you try to test your program on the cluster you're going to waste inordinate amounts of time shuttling code back and forth, as well as potentially waiting on other students who run long jobs. Before you run any MapReduce code, you should unit test individual functions, calling them on a single piece of example input data (e.g., a single line out of the data set) and seeing what they do. After you have unit tested all your components, you should do some small integration tests which make sure all the working parts fit together, and work on a small amount of representative data.

For this purpose, we have posted a *very* small data set on the web site at: http://www.cs.washington.edu/education/courses/cse490h/08au/projects/wikimicro.txt.gz This is a ~2 MB download which will unzip to about 5 MB of text in the exact format you should expect to see on the cluster. Download the contents of this file and stick it in an "input" folder on your local machine. Test against this for your unit testing and initial debugging. After this works, then move up to /shared/wiki_small. After it works there, then and only then should you run on the full dataset. If individual passes of your MapReduce program take more than 15 minutes, you have done something wrong. You should kill your job (see instructions in assignment 1), figure out where your bugs are, and try again. (Don't forget to pre-test on smaller datasets again!) Extensions: (For the Fearless) If you're finished early and want a challenge, feel free to try some/all of these. Karma++ if you do. Write a pass that determines whether or not PageRank has converged, rather than using a fixed number of iterations Get an inverted indexer working over the text of this XML document Combine the inverted indexer and PageRank to make a simple search engine Write a MapReduce pass to find the top ten pages without needing to push everything through a single reducer (I haven't done these. No bets on how easy/hard these are.) Writeup: Please write up in a text file (no Word documents, PDFs, etc -- plain old text, please!) the answers to the following questions: 1) How much time did this lab take you? What was the most challenging component? 2) Describe the data types you used for keys and values in the various MapReduce stages. If you serialized data to Text and back, describe how you laid the contents of the Text out in the data stream. 3) What scalability hazards, if any, do you see in this system? Do you have any ideas how these might be overcome or mitigated? 4) What are the 10 pages with the highest PageRank? Does that surprise you? Why/why not? 5) Describe how you tested your code before running on the large data set. Did you find any tricky or surprising bugs? Include any test programs / JUnit test cases, etc, in your code submission. 6) List any extensions you performed, assumptions you made about the nature of the problem, etc.