Distributed Erlang and Map-Reduce

Similar documents
Map-Reduce (PFP Lecture 12) John Hughes

Map-Reduce. John Hughes

Parallel Programming in Erlang (PFP Lecture 10) John Hughes

An Erlang primer. Johan Montelius

MapReduce in Erlang. Tom Van Cutsem

CS390 Principles of Concurrency and Parallelism. Lecture Notes for Lecture #5 2/2/2012. Author: Jared Hall

AMS 200: Working on Linux/Unix Machines

Erlang: distributed programming

Unix Tutorial Haverford Astronomy 2014/2015

Using LINUX a BCMB/CHEM 8190 Tutorial Updated (1/17/12)

CPS 310 first midterm exam, 10/6/2014

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

CSC111 Computer Science II

HHH Instructional Computing Fall

CpSc 1111 Lab 1 Introduction to Unix Systems, Editors, and C

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science

How To Upload Your Newsletter

First of all, these notes will cover only a small subset of the available commands and utilities, and will cover most of those in a shallow fashion.

Getting started with Hugs on Linux

Review of Fundamentals

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Lab 4: Shell Scripting

Project 1: Implementing a Shell

Linux Tutorial #6. -rw-r csce_user csce_user 20 Jan 4 09:15 list1.txt -rw-r csce_user csce_user 26 Jan 4 09:16 list2.

Parallel Computing: MapReduce Jin, Hai

Using the Zoo Workstations

Lab 4: Bash Scripting

Carnegie Mellon. Linux Boot Camp. Jack, Matthew, Nishad, Stanley 6 Sep 2016

Parallel Programming Pre-Assignment. Setting up the Software Environment

Distributed Systems. Day 3: Principles Continued Jan 31, 2019

Intermediate Programming, Spring Misha Kazhdan

Read the relevant material in Sobell! If you want to follow along with the examples that follow, and you do, open a Linux terminal.

Review of Fundamentals. Todd Kelley CST8207 Todd Kelley 1

CMSC 104 Lecture 2 by S Lupoli adapted by C Grasso

CS 1110 SPRING 2016: GETTING STARTED (Jan 27-28) First Name: Last Name: NetID:

Introduction to UNIX. Logging in. Basic System Architecture 10/7/10. most systems have graphical login on Linux machines

Unix/Linux Basics. Cpt S 223, Fall 2007 Copyright: Washington State University

CS Fundamentals of Programming II Fall Very Basic UNIX

Working with Basic Linux. Daniel Balagué

To remotely use the tools in the CADE lab, do the following:

Programming Assignment

Week - 01 Lecture - 04 Downloading and installing Python

CISC 181 Lab 2 (100 pts) Due: March 4 at midnight (This is a two-week lab)

Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring Topic Notes: C and Unix Overview

Laboratory 1 Semester 1 11/12

Getting started with UNIX/Linux for G51PRG and G51CSA

The Actor Model, Part Two. CSCI 5828: Foundations of Software Engineering Lecture 18 10/23/2014

Additional laboratory

STA 303 / 1002 Using SAS on CQUEST

CISC 181 Lab 2 (100 pts) Due: March 7 at midnight (This is a two-week lab)

Deadline. Purpose. How to submit. Important notes. CS Homework 9. CS Homework 9 p :59 pm on Friday, April 7, 2017

Computer Science 322 Operating Systems Mount Holyoke College Spring Topic Notes: C and Unix Overview

IronWASP (Iron Web application Advanced Security testing Platform)

Installing and configuring an Android device emulator. EntwicklerCamp 2012

CMSC 201 Spring 2018 Lab 01 Hello World

Getting started with Hugs on Linux

Lab 2 Building on Linux

Starting the System & Basic Erlang Exercises

How to Edit Your Website

Scripting Languages Course 1. Diana Trandabăț

Due: 9 February 2017 at 1159pm (2359, Pacific Standard Time)

web.py Tutorial Tom Kelliher, CS 317 This tutorial is the tutorial from the web.py web site, with a few revisions for our local environment.

Linux shell scripting Getting started *

Lab: Supplying Inputs to Programs

Parallel MATLAB at VT

CMSC 201 Spring 2017 Lab 01 Hello World

CSci 4061 Introduction to Operating Systems. IPC: Basics, Pipes

Linux Tutorial #1. Introduction. Login to a remote Linux machine. Using vim to create and edit C++ programs

PART 1: Getting Started

Study Guide Processes & Job Control

A process. the stack

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Applications of. Virtual Memory in. OS Design

AN SEO GUIDE FOR SALONS

Backup everything to cloud / local storage. CloudBacko Home. Essential steps to get started

Lecture 21: Red-Black Trees

CS 347 Parallel and Distributed Data Processing

CS106 Lab 1: Getting started with Python, Linux, and Canopy. A. Using the interpreter as a fancy calculator

CS 347 Parallel and Distributed Data Processing

CS 345A Data Mining. MapReduce

Exercise sheet 1 To be corrected in tutorials in the week from 23/10/2017 to 27/10/2017

ITCS 4145/5145 Assignment 2

About this course 1 Recommended chapters... 1 A note about solutions... 2

Introduction to Linux and Supercomputers

CS 1110, LAB 1: EXPRESSIONS AND ASSIGNMENTS First Name: Last Name: NetID:

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Life Without NetBeans

CSE 391 Lecture 3. bash shell continued: processes; multi-user systems; remote login; editors

The Unix Shell & Shell Scripts

Java Program Structure and Eclipse. Overview. Eclipse Projects and Project Structure. COMP 210: Object-Oriented Programming Lecture Notes 1

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Unix tutorial. Thanks to Michael Wood-Vasey (UPitt) and Beth Willman (Haverford) for providing Unix tutorials on which this is based.

CSC374, Spring 2010 Lab Assignment 1: Writing Your Own Unix Shell

Linux Operating System Environment Computadors Grau en Ciència i Enginyeria de Dades Q2

Introduction to Unix

Note: Who is Dr. Who? You may notice that YARN says you are logged in as dr.who. This is what is displayed when user

bash, part 3 Chris GauthierDickey

Launch Store. University

Lab 4. Out: Friday, February 25th, 2005

Using UNIX. -rwxr--r-- 1 root sys Sep 5 14:15 good_program

Transcription:

Distributed Erlang and Map-Reduce The goal of this lab is to make the naïve map-reduce implementation presented in the lecture, a little less naïve. Specifically, we will make it run on multiple Erlang nodes, balance the load between them, and begin to make the code fault-tolerant. Erlang resources You will probably need to consult the Erlang documentation during this exercise. You can find the complete documentation here: http://www.erlang.org/doc/. To find documentation of a particular module, use the list of modules here: http://www.erlang.org/doc/man_index.html. Note that the Windows installer also installs the documentation locally, so if you are using Windows then you can just open the documentation via a link in the Start menu. Connecting multiple Erlang nodes The first step is to set up a network of connected Erlang nodes to play with. This can be done in two ways: Running multiple Erlang nodes on one machine Start several terminal windows/windows cmd windows, and in each one start a named Erlang shell. Do this using a command such as erl sname foo on Linux or the Mac, and werl sname foo on Windows. (The Windows version starts Erlang in its own window, with some useful menus). The prompt displayed by the Erlang shell will show you what each Erlang node you created is called. For example, on my machine the prompt is (baz@johnstablet2012)1> This tells me that the node I created is called baz@johnstablet2012 (an Erlang atom). Running Erlang nodes on multiple machines It s more fun using several machines. The procedure is the same as above, but first you must ensure that all machines use the same cookie. Edit the file.erlang.cookie in your home directory on each machine, and place the same Erlang atom in each one. Then start Erlang nodes as above; as long as the machines are on the same network, then they should be able to find each other. In particular, machines in the labs at Chalmers ought to be able to find each other. Connecting the nodes together Your Erlang nodes are not yet connected calling nodes() on any of them will return the empty list. To connect them, call net_adm:ping(nodeb).

on NodeA (two of your node names). The result should be pong, and calling nodes() afterwards on either node should show you the other. Connect all your nodes in this way. Note that because Erlang builds a complete network, then you need only connect each node to one other node yourself. Help! It doesn t work On multiple machines, check that the cookie really is the same on all the nodes. Call erlang:get_cookie() on each node to make sure. If NodeA can t connect to NodeB, try connecting NodeB to NodeA. Sometimes that helps! Perhaps one or more of your machines requires a login before the network connection can be established. In a Windows network, try visiting the Shared Folder on each machine from the others this may prompt for a password, and once you give the password then Erlang will also be able to connect. Remote Procedure Calls We ll start by making remote procedure calls to other nodes. We ll call io:format, which is Erlang s version of printf. Try rpc:call(othernode,io,format,[ hello ]). You will find hello printed on your own node! Erlang redirects the output of processes spawned on other nodes back to the original spawning node so io:format really did run on the other node, but its output was returned to the first one. To force output on the node where io:format runs, we also supply an explicit destination for the output. Try rpc:call(othernode,io,format,[user, hello,[]]). (where the last argument is the list of values for escapes like ~p in the string since hello contains no escapes, then we pass the empty list). Make sure that the output really does appear on the correct node. Compiling and loading Loading code on other nodes is very simple. Write a simple module containing this function: -module(foo). foo() -> io:format(user, hello,[]). Now you can compile this module in the shell via c(foo). and you can then load it onto all your nodes via the command nl(foo). Try using rpc:call to call foo:foo on each node, checking that the output appears on the correct node. Naïve Map-Reduce Below you will find the source code of three of the modules presented in the lecture on map-reduce: a very simple map-reduce implementation on one node (both sequential and parallel), and two

clients a web crawler and a page rank calculator. Compile these modules, and ensure that you can crawl a part of the web. crawl( http://www.cse.chalmers.se/,3). You will need to start Erlang s http client first, using inets:start(). The page rank calculator uses the information collected by the web crawler, but it assumes that the output of the web crawler has been saved in a dets file a file that contains a set of key-value pairs. You will need to use dets to do this lab. You can find the documentation here (http://www.erlang.org/doc/man/dets.html) and there is quite a lot of it but you will only need a few functions from this module. dets:open_file see code below for usage dets:insert which inserts a list of key-value pairs into the file dets:lookup which returns a list of all the key-value pairs with a given key Save the results from the web crawler in a dets file called web.dat, and check that the page ranking algorithm works. Then copy web.dat onto all your nodes this will enable you to distribute the pagerank computation across your network. You should collect 40-100MB of web data so that the pageranking algorithm takes appreciable time to run. Distributing Map-Reduce Modify the parallel map-reduce implementation so that it spawns worker processes on all of your nodes. Measure the performance of the page-ranking algorithm with the original parallel version, and your new distributed version. Load-balancing Map-Reduce Of course it is not really sensible to spawn all the worker processes at the same time. Instead, we should start enough workers to keep all the nodes busy, then send each node new work as it completes its previous job. Write a worker pool function which, given a list of 0-ary functions, returns a list of their results, distributing the work across the connected nodes in this way. That is, semantically worker_pool(funs) -> [Fun() Fun <- Funs], but the implementation should make use of all the nodes in your network. A good approach is to start several worker processes on each node, each of which keeps requesting a new function to call, then calling it and returning its result to the master, until no more work remains to be done. Modify the map-reduce implementation again to make use of your worker pool in both the map and the reduce phases. Measure the performance of page ranking with your new distributed map-reduce is it faster? Fault-tolerant Map-Reduce Enhance your worker-pool to monitor the state of the worker processes, so that if a worker should die, then its work is reassigned to a new worker. Test your fault tolerance by killing one of your Erlang nodes (not the master) while the page-ranking algorithm is running. It should complete, with the same results, despite the failure.

Hand ins You should submit the code of the three versions of map-reduce described above, together with your performance measurements. Describe your set-up: were you running on one machine or several, how much web data were you searching? What conclusions would you draw from this exercise? The deadline is midnight on Tuesday, 21st May. More A full map-reduce implementation does a lot more than this, of course. The next step would be to avoid sending all the data via the master the results of each mapper should be sent directly to the right reducer although this introduces a lot more complexity. Something to experiment with later, perhaps? The Code This is a very simple implementation of map-reduce, in both sequential and parallel versions. -module(map_reduce). We begin with a simple sequential implementation, just to define the semantics of map-reduce. The input is a collection of key-value pairs. The map function maps each key value pair to a list of key-value pairs. The reduce function is then applied to each key and list of corresponding values, and generates in turn a list of key-value pairs. These are the result. map_reduce_seq(map,reduce,input) -> Mapped = [{K2,V2} {K,V} <- Input, {K2,V2} <- Map(K,V)], reduce_seq(reduce,mapped). reduce_seq(reduce,kvs) -> [KV {K,Vs} <- group(lists:sort(kvs)), KV <- Reduce(K,Vs)]. group([]) -> []; group([{k,v} Rest]) -> group(k,[v],rest). group(k,vs,[{k,v} Rest]) -> group(k,[v Vs],Rest); group(k,vs,rest) ->

[{K,lists:reverse(Vs)} group(rest)]. map_reduce_par(map,m,reduce,r,input) -> Parent = self(), Splits = split_into(m,input), Mappers = [spawn_mapper(parent,map,r,split) Split <- Splits], Mappeds = [receive {Pid,L} -> L end Pid <- Mappers], Reducers = [spawn_reducer(parent,reduce,i,mappeds) I <- lists:seq(0,r-1)], Reduceds = [receive {Pid,L} -> L end Pid <- Reducers], lists:sort(lists:flatten(reduceds)). spawn_mapper(parent,map,r,split) -> spawn_link(fun() -> Mapped = [{erlang:phash2(k2,r),{k2,v2}} {K,V} <- Split, {K2,V2} <- Map(K,V)], Parent! {self(),group(lists:sort(mapped))} end). split_into(n,l) -> split_into(n,l,length(l)). split_into(1,l,_) -> [L]; split_into(n,l,len) -> {Pre,Suf} = lists:split(len div N,L), [Pre split_into(n-1,suf,len-(len div N))]. spawn_reducer(parent,reduce,i,mappeds) -> Inputs = [KV Mapped <- Mappeds, {J,KVs} <- Mapped, I==J, KV <- KVs], spawn_link(fun() -> Parent! {self(),reduce_seq(reduce,inputs)} end). This implements a page rank algorithm using map-reduce -module(page_rank). Use map_reduce to count word occurrences

map(url,ok) -> [{Url,Body}] = dets:lookup(web,url), Urls = web_crawler:find_urls(url,body), [{U,1} U <- Urls]. reduce(url,ns) -> [{Url,lists:sum(Ns)}]. 188 seconds page_rank() -> dets:open_file(web,[{file,"web.dat"}]), Urls = dets:foldl(fun({k,_},keys)->[k Keys] end,[],web), map_reduce:map_reduce_seq(fun map/2, fun reduce/2, [{Url,ok} Url <- Urls]). 86 seconds page_rank_par() -> dets:open_file(web,[{file,"web.dat"}]), Urls = dets:foldl(fun({k,_},keys)->[k Keys] end,[],web), map_reduce:map_reduce_par(fun map/2, 32, fun reduce/2, 32, [{Url,ok} Url <- Urls]). This module defines a simple web crawler using map-reduce. -module(crawl). Crawl from a URL, following links to depth D. Before calling this function, the inets service must be started using inets:start(). crawl(url,d) -> Pages = follow(d,[{url,undefined}]), [{U,Body} {U,Body} <- Pages, Body /= undefined]. follow(0,kvs) -> KVs; follow(d,kvs) -> follow(d-1, map_reduce:map_reduce_par(fun map/2, 20, fun reduce/2, 1, KVs)). map(url,undefined) -> Body = fetch_url(url), [{Url,Body}] ++ [{U,undefined} U <- find_urls(url,body)]; map(url,body) -> [{Url,Body}]. reduce(url,bodies) -> case [B B <- Bodies, B/=undefined] of

end. [] -> [{Url,undefined}]; [Body] -> [{Url,Body}] fetch_url(url) -> case httpc:request(url) of {ok,{_,_headers,body}} -> Body; _ -> "" end. Find all the urls in an Html page with a given Url. find_urls(url,html) -> Lower = string:to_lower(html), Find all the complete URLs that occur anywhere in the page Absolute = case re:run(lower,"http://.*?(?=\")",[global]) of {match,locs} -> [lists:sublist(html,pos+1,len) [{Pos,Len}] <- Locs]; _ -> [] end, Find links to files in the same directory, which need to be turned into complete URLs. Relative = case re:run(lower,"href *= *\"(?!http:).*?(?=\")",[global]) of {match,rlocs} -> [lists:sublist(html,pos+1,len) [{Pos,Len}] <- RLocs]; _ -> [] end, Absolute ++ [Url++"/"++ lists:dropwhile( fun(char)->char==$/ end, tl(lists:dropwhile(fun(char)->char/=$" end, R))) R <- Relative].