April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Size: px

Start display at page:

Download "April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model."

Deirdre Elliott
5 years ago
Views:

1 1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map and a Reduce step Mapper: processes a key value pair and generates one or multiple intermediate key value pairs Reducer: processes an intermediate key and all values associated with the intermediary key Other components: Combiner, Writable(s), Input Splitter, Partitioner, Output Committer b) What is a combiner and why does it help to reduce the execution time of MapReduce applications? Combiner processes all intermediary key/value pairs of a single mapper to similarly to a reducer. It reduces the amount of data that has to be shuffled and sorted between mapper and reducer. c) Name two constraints on using a combiner. The input and the output types of the combiner have to be identical, and match the reducers input arguments Operation performed by the combiner has to be associative and commutative - 1 -

2 d) Compared to the programming model propagated by MPI, name one advantage and one disadvantage of MapReduce over MPI Advantage of MapReduce over MPI: No restrictions on data formats and sources Good fault tolerance Disadvantages of MapReduce over MPI: Restrictive parallel model (e.g. bad for iterative applications, graph problems etc) performance e) What is the advantage of using a specialized system such as Pregel/Giraph over standard MapReduce for Graph algorithms? Avoids having multiple iterations as a separate Map reduce job Reduces disk traffic by keeping data in-memory Allows to specify communication along the edges of the graph - 2 -

3 2. Parallel File Systems a) What is the main goal of a parallel file system and how does that differ from a distributed file system? The goal of parallel file systems is to improve the bandwidth of I/O operations by using multiple disks simultaneously. Distributed file systems on the other hand focus on making a file system reachable from remote nodes/locations. b) List and explain (briefly) the main components of the HDFS file system. Namenode manages the File System's namespace/meta-data/file blocks Runs on 1 machine to several machines Datanode Stores and retrieves data blocks Reports to Namenode Runs on many machines Secondary Namenode Performs house keeping work so Namenode doesn t have to Requires similar hardware as Namenode machine Not used for high-availability not a backup for Namenode c) Name two usage restrictions for the current implementation of HDFS No overwriting of existing data possible, can only append to a file Only a single process can modify a file - 3 -

4 3. HBase a) What are the key components of HBase? HBase Master: daemon responsible for managinghbase Region Server: manages region files Zookeeper: coordination service between clients and HBase Master Region: a range of rows stored together Tables: a set of rows and columns organized in regions b) Name two major differences between nosql data base such as Hbase and relational database. Nosql databases have often a dynamic schema (e.g. you can add new columns on the fly) Horizontal scaling of nosql databases vs. vertical for most SQL databases Open source development model vs. mixed (but often commercial) for SQL databasis) Limited support for transactions Access through object oriented API vs. query language - 4 -

5 4. Canopy Clustering in MapReduce: The canopy algorithm is often used as a pre-clustering algorithm to optimize the initial selection of cluster centroids. The algorithm consists of the following steps: 1. Select two distances threshold values T1 and T2 with T1 < T2 2. Select any point (random) from the list to form a cluster center 3. Calculate distance to all other points in the list 4. Assign all points which fall within the distance threshold T1 to the cluster and remove them from the list of points 5. Select a random point from the list as another cluster center fulfilling the condition, that the distance between the new cluster center and all other existing centers is larger than the threshold T2 6. Repeat steps 3-6 until original list is empty. Assuming that the data points are provided as a text input file with one data point per line, and the MapReduce implementation uses one job for steps 3 and 4, please answer the following questions: a) What mechanism in Hadoop MapReduce could be used to broadcast the current cluster centers to all mappers? ( 1 Pt) Distributed cache b) In each iteration, a new cluster center is being evaluated. Assuming that a mapper calculates the distance of a given data point (which has not yet been removed from the list) to the new cluster center, what should be the intermediate key/value pair? Justify your answer, if necessary by outlining what the reducers should do with the intermediate key/value pair. Key: constant string/int Value: Tuple of (dataopint, distance to center) The reducer than removes all points that are below the threshold T1 and writes the remaining points into the output file which is used as an input to the next iteration - 5 -

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides