Big Data and Scripting map reduce in Hadoop

Big Data and Scripting map reduce in Hadoop 1,

2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks are executed sequential setting up the real thing on each involved node: configure storage for HDFS (hdfs-site.xml) configure Hadoop environment: set path to JRE, put hadoop/bin in path (hadoop-env.sh) configure ports (core-site.xml) configure ssh-access without password for Hadoop-user format HDFS hadoop namenode -format on master system configure slaves (slaves) start all nodes at once start-all.sh stop with stop-all.sh

3, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb Hadoop provides customization at all steps: input reading map combine partition shuffle and sort reduce output

4, input reading split up input file(s) into records of key and value key: usually positional information value: line/part of file keys have no influence on distribution to mappers individual input readers implement InputFormat use to split up e.g. XML-files into records input readers should split, but not parse the input input readers are run on individual blocks executed on nodes that store the corresponding blocks records that could overlap blocks have to be handled manually

5, combine combine patterns emitted on a node before distribution input data is fed into mapper mapper output on each node is fed into combiner output of combiner is distributed to reducers combiners compress output on a semantic level example word count: reduce 3 (word,1) to (word,3) one step to reduce network traffic compression of key value pairs on data level is implemented by framework combiners implement the Reducer interface

6, partition decide how key/value pairs are distributed to nodes output of partitioner is written to HDFS (one file per partition) nodes that start reducers pull (download) the files corresponding to their partition default implementation uses hash functions for random uniform distribution partitioner can influence distribution directly e.g. ensure that certain pairs end up in the same node output of these will also end up on the same node interface: Partitioner map key/value partition (int)

7, influencing the shuffle/sort each reducer-node starts a shuffle/sort process download partition files from mapper-nodes sort key/value pairs by key only parameter: custom Comparator for keys influence order and equality implement ordering within equivalent classes of keys

8, output analogous to import started individually for results on each reducer-node

9, generic information distribution transport information in addition to key/value pairs examples: small-scale parameters, e.g. number of clusters for k-means large-scale additional information, e.g. look-up tables use JobConf to distribute algorithm parameters set generic parameters using e.g. set(string name, String value) use config() implementation of mapper/reducer/partitioner/combiner to retrieve before execution each class is provided with configuration initialize class variables from general job configuration

10, generic information distribution large files should not be transported by JobConf instead access files via HDFS use DistributedCache to make local files available on all involved nodes DistributedCache.addCacheArchive(URI uri, Configuration conf) add archive DistributedCache.addCacheFile() add archive (same parameters) uri points to http:// or hdfs:// location files are added by reference to configuration when tasks are executed on particular nodes, files will be made available in working directory archives will be unpacked send executable script as new URI("hdfs://host:8020/script.awk#script")

11, accessing cached files Path[] DistributedCache.getLocalCacheArchives(conf) Path[] DistributedCache.getLocalCacheFiles(conf) retrieve list of all cached archives/files localization is implemented in framework

12, map reduce design patterns book: MapReduce Design Patterns (Miner, Shook, 2012, OReilly) design patterns describe mechanisms common to many algorithms solve similar problems in a number of contexts in a sense generic algorithms

13, Summarization general intention group records by common field aggregate all value with identical value in field examples: word count, mean, min, max, counting within groups similar to GROUP BY in SQL but with individual aggregation result: table with one entry (row) per group each row contains group key, aggregated value(s)

14, Summarization implementation: map each input element to group, remove all values not needed for aggregation combine group elements by partial aggregation, if possible reduce by aggregating values and returning group key/(aggregated values) combiner can drastically reduce network traffic custom partitioner can be necessary to resolved skewed group-size distribution

15, Counting counting is a very simple example of map reduce usage actually the reduce step is often not necessary for limited number of counters, counting can be implemented using map-only this can be achieved with the Reporter approach mappers count occurrences produce no output pairs no shuffle or reduction necessary functions: Reporter.incrCounter(String group, String counter, long amount) create/increase counter Reporter.getCounter(String group, String counter) versions with Enum-adressing available

16, Filtering intention filter input records by some property limit execution to those, that pass examples: distributed grep thresholding data cleansing random sampling as with counting, no reduce is necessary data is read and written from/to local node additional reduce can be used to write filter result to single file many small files slow down computations mapping all filtered results to single reducer allows to compact result

17, Distinct intention return only distinct values from a larger set filter out duplicates or values that are highly similar to each other implementation for each record, extract field considered in similarity use those as key, null as value shuffle will transport identical values to one reducer reducer only stores one copy combiner is extremely useful use many reducers reducer does not produce high load every distinct value needs another reducer instance single mapper produces many distinct values

18, structured to hierarchical intention convert a flat format (e.g. an SQL-table with value repetitions) into structured format example: given a table storing values for a foreign key each row contains set of values and key keys are repeated structure by storing key set of values implementation map rows to foreign key (higher element in desired structure) collect all values for each key and store as structured records can be repeated bottom up for several layers of structure used to convert flat data into structured records such as JSON/XML

19, partitioning intention partition data by some property (e.g. data) create smaller groups that can be analyzed individually example group log entries by date to allow analysis on a monthly basis implementation use identity mapper with partition property as key implement custom partitioner, creating partitions by key reducer only writes values to output all data is written to one logical file blocks are distributed as proposed in the partitioner e.g. all blocks for a particular month are on one DataNode

20, binning similar to partitioning, but one file per bin implementation map input values to bins use MultipleOutputs to write files no combine/shuffle/reduce: job.setnumreducetasks(0); writing is implemented directly in map-function write(string namedoutput, K key, V value) MultipleOutputs is configured from JobConfiguration problem: each involved mapper creates one file per bin each bin is distributed over all involved nodes

21, sorting with the shuffle step implementation of sort algorithm can exploit the sorting that takes place in the mapping step use the idea presented before: analyze phase determines sorting buckets order phase sorts analyze sample input and map to sort keys, without value set number of reducers to one shuffling sorts keys, single reducer gets sorted list of keys create slices of equal size

22, sorting - order phase mapper maps to sort key, values attached this time custom partitioner load partition mapping and applies it to input values TotalOrderPartitioner provides implementation for this step set number of reducers to number of partitions reducer only write incoming data to file (order is already correct) output is written to part-r-* files (number instead of *) files ordering corresponds to ordering of values values within files are sorted

23, Shuffle intention: destroy order of data arrangement motivation/applications anonymizing repeatable random sampling distribution of highly accessed parts to multiple nodes implementation map values to random key shuffling implements random distribution reducer only writes

24, Joins join as in SQL JOIN, join two tables by common key implementation map values to join key use partitioner for even distribution to reducers reducer collects values in temporary lists using external storage if necessary one list for each source table in final step reducer produces output pairs from lists can be used to implement all types of joins: inner, outer, left, right, anti all elements with common key are computed on single reducer can be problematic when many values have the same key

25, replicated join join very large data set with many small data sets large data set is left output elements correspond to elements of large data set small data sets fit into memory implementation mapper reads smaller data sets in initialization processes each record by joining it with elements from small data sets no combine/shuffle/reduce joined data is written directly after mapping

26, cartesian product intention: create all pairs of input values from data sets A and B application: pairwise analysis of elements (e.g. determine distance matrix) implementation create partitions of both data sets: A 1,..., A n (n parts), B 1,..., B m (m parts) repeat values in each partition A 1 1,..., Am 1,... use n m reducers each reducer receives all values from pair of parts A i, B j reducer produces all possible pairings