CmpE 138 Spring 2011 Special Topics L2

CmpE 138 Spring 2011 Special Topics L2 Shivanshu Singh shivanshu.sjsu@gmail.com

Map Reduce ElecBon process

Map Reduce Typical single node architecture Applica'on CPU Memory Storage

Map Reduce Applica'on

Map Reduce CounBng SorBng Merge sort, Quick sort BIG Data Data Mining Trend Analysis e.g. TwiPer RecommendaBon Systems If bought = (A, B) => likely to buy C Google Search

The Underlying Technologies

Distributed systems, storage, compubng. Web data sets can be very large Tens to hundreds of terabytes. soon petabyte(s) Cannot mine on a single server (why?) Standard architecture emerging: Cluster of commodity Linux nodes (very) High speed Ethernet interconnect How to organize computabons on this architecture? Storage is cheap but data management is not (Nodes are bound to fail) Mask issues such as hardware failure

Mask issues such as hardware failure

Goal: Stable Storage For: (Stable) ComputaBon In other words if any of the nodes, fails how do we ensure data availability, persistence.?

Goal: Stable Storage Answer: Distribute it have redundancy Filesystem! Manage this Data operabons and services Store, Retrieve on a single logical resource that is distributed over a number of locabons.

DFS Distributed File System Provides global file namespace Google GFS; Hadoop HDFS; etc. Typical usage papern Huge files (100s of GB to TB) Reads and appends are common

DFS Chunk Servers File is split into conbguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Master node (GFS) a.k.a. Name Nodes in HDFS Stores metadata Might be replicated Client library for file access Talks to master to find chunk servers Connects directly to chunk servers to access data

Chubby A coarse- grained lock service distributed systems can use this to synchronize access to shared resources Intended for use by loosely- coupled distributed systems In GFS: Elect a master In BigTable: master elecbon, client discovery, table service locking

Interface Presents a simple distributed file system Clients can open/close/read/write files Reads and writes are whole- file Also supports advisory reader/writer locks Clients can register for nobficabon of file update

Topology Replica Replica ALL Client Traffic Master Replica Replica One Chubby Cell

Master ElecBon All replicas try to acquire a write lock on designated file. The one who gets the lock is the master. Master can then write its address to file other replicas can read this file to discover the chosen master name. Chubby doubles as a name service

Consensus Chubby cell is usually 5 replicas 3 must be alive for cell to be viable How do replicas in Chubby agree on their own master, official lock values? PAXOS algorithm

PAXOS Paxos is a family of algorithms (by Leslie Lamport) designed to provide distributed consensus in a network of several processors.

Processor AssumpBons Operate at arbitrary speed Independent, random failures Processors with stable storage may rejoin protocol aoer failure Do not lie, collude, or apempt to maliciously subvert the protocol

Network AssumpBons All processors can communicate with (see) one another Messages are sent asynchronously and may take arbitrarily long to deliver Order of messages is not guaranteed: they may be lost, reordered, or duplicated Messages, if delivered, are not corrupted in the process

A Fault Tolerant Memory of Facts Paxos provides a memory for individual facts in the network. A fact is a binding from a variable to a value. Paxos between 2F+1 processors is reliable and can make progress if up to F of them fail.

Roles Proposer An agent that proposes a fact Leader the authoritabve proposer Acceptor holds agreed- upon facts in its memory Learner May retrieve a fact from the system

Safety Guarantees Nontriviality: Only proposed values can be learned Consistency: Only at most one value can be learned Liveness: If at least one value V has been proposed, eventually any learner L will get some value

Key Idea Acceptors do not act unilaterally. For a fact to be learned, a quorum of acceptors must agree upon the fact A quorum is any majority of acceptors Given acceptors {A, B, C, D}, Q = {{A, B, C}, {A, B, D}, {B, C, D}, {A, C, D}}

Basic Paxos Determines the authoritabve value for a single variable Several proposers offer a value V n to set the variable to. The system converges on a single agreed- upon V to be the fact.

Step 1: Prepare Credit: spinnaker labs inc.

Step 2: Promise PROMISE x Acceptor will accept proposals only numbered x or higher Proposer 1 is ineligible because an Acceptor quorum has voted for a higher number than j Credit: spinnaker labs inc.

Step 3: Accept Credit: spinnaker labs inc.

Step 4: Accepted ack Credit: spinnaker labs inc.

Learning If a learner interrogates the system, a quorum will respond with fact V_k

Basic Paxos conbnued.. Proposer 1 is free to try again with a proposal number > k; can take over leadership and write in a new authoritabve value Official fact will change atomically on all acceptors from the perspecbve of learners If a leader dies mid- negobabon, value just drops, another leader tries with higher proposal

Paxos in Chubby Replicas in a cell inibally use Paxos to establish the leader. Majority of replicas must agree Replicas promise not to try to elect new master for at least a few seconds (master lease) Master lease is periodically renewed Read More: hpp://labs.google.com/papers/chubby.html hpp://labs.google.com/papers/bigtable- osdi06.pdf

Big Table Google s Needs: Data reliability High speed retrieval Storage of huge numbers of records (several TB of data) (MulBple) past versions of records should be available

HBase - Big Table Features: Simplified data retrieval mechanism (row, col, Bmestamp) value lookup, only No relabonal operators Arbitrary number of columns per row Arbitrary data type for each column New constraint: data validabon must be performed by applicabon layer!

Logical Data RepresentaBon Rows & columns idenbfied by arbitrary strings MulBple versions of a (row, col) cell can be accessed through Bmestamps ApplicaBon controls version tracking policy Columns grouped into column families

Data Model Related columns stored in fixed number of families Family name is a prefix on column name e.g., fileapr:owning_group, fileapr:owning_user A column name has the form "<family>:<label>" where <family> and <label> can be arbitrary byte arrays. Lookup is Hash based Column families stored physically close on disk items in a given column family should have roughly the same read/write characterisbcs and contain similar data.

Conceptual View Column family

Physical Storage View Each stored in conbguous chunks over mulbple nodes as the data grows

Example GET DecimalFormat decimalformat = new DecimalFormat("0000000"); HTable htable = new HTable("rest_data"); String str =decimalformat.format(4); Get g = new Get(Bytes.toBytes(str)); Result r = htable.get(g); NavigableMap<byte[], byte[]> map = r.getfamilymap(bytes.tobytes("feature"));

Example PUT DecimalFormat restidformat = new DecimalFormat("0000000"); HTable htable = new HTable("restaurants"); String restid = restidformat.format(4); Put put = new Put(Bytes.toBytes("rest_ids")); put.add(bytes.tobytes("restaurant_id"), Bytes htable.put(put);.tobytes(restid), Bytes.toBytes(restId));

HBase - BigTable Further Reading with many more details: hpp://wiki.apache.org/hadoop/hbase/ HbaseArchitecture hpp://labs.google.com/papers/bigtable- osdi06.pdf

MapReduce ImplementaBons run on the backbone of DFS such as HDFS, GFS Using if needed, storage solubons like HBase, BigTable

Word Count We have a large file of words, one word to a line Count the number of Bmes each disbnct word appears in the file Sample applicabon: analyze web server logs to find popular URLs

Word Count Input: a set of key/value pairs User supplies two funcbons: map(k,v) Intermediate: list(k1,v1) reduce(k1, list(v1)) à v2 (k1,v1) is an intermediate key/value pair Output is the set of (k1,v2) pairs

Word Count using MapReduce map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result)

Overview

Data Flow Input, final output are stored on a distributed file system (GFS, HDFS) Scheduler tries to schedule map tasks close to physical storage locabon of input data Intermediate results are stored on local FS of map and reduce workers Output is ooen input to another map reduce task E.g. data mining apriori algorithm

CoordinaBon Master data structures Task status: (idle, in- progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the locabon and sizes of its R intermediate files, one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures

Failures Map worker failure Map tasks completed or in- progress at worker are reset to idle Reduce workers are nobfied when task is rescheduled on another worker Reduce worker failure Only in- progress tasks are reset to idle Master failure MapReduce task is aborted and client is nobfied

Combiners Ooen a map task will produce many pairs of the form (k,v1), (k,v2),... for the same key k E.g., popular words in Word Count Can save network Bme by pre- aggregabng at mapper combine(k1, list(v1)) à v2 Usually same as reduce funcbon

ParBBon FuncBon Inputs to map tasks are created by conbguous splits of input file For reduce, we need to ensure that records with the same intermediate key end up at the same worker System uses a default parbbon funcbon e.g., hash(key) mod R SomeBmes useful to override E.g., hash(hostname(url)) mod R ensures URLs from a host end up in the same output file

More Reading hpp://labs.google.com/papers/mapreduce- osdi04- slides/index.html hpp://labs.google.com/papers/mapreduce- osdi04.pdf hpp://wiki.apache.org/hadoop/ hpp://code.google.com/edu/parallel/mapreduce- tutorial.html