Big Data and Scripting map reduce in Hadoop

Similar documents
Hadoop. copyright 2011 Trainologic LTD

1. Stratified sampling is advantageous when sampling each stratum independently.

Introduction to MapReduce

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop Map Reduce 10/17/2018 1

Map Reduce and Design Patterns Lecture 4

CS 378 Big Data Programming

CS435 Introduction to Big Data Spring 2018 Colorado State University. 2/5/2018 Week 4-A Sangmi Lee Pallickara. FAQs. Total Order Sorting Pattern

MI-PDB, MIE-PDB: Advanced Database Systems

Big Data Analysis using Hadoop Lecture 3

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Programming Models MapReduce

Hadoop & Big Data Analytics Complete Practical & Real-time Training

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX

Database Applications (15-415)

Introduction to BigData, Hadoop:-

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Clustering Lecture 8: MapReduce

Hadoop. Introduction to BIGDATA and HADOOP

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Topics covered in this lecture

Hadoop Streaming. Table of contents. Content-Type text/html; utf-8

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

HIVE MOCK TEST HIVE MOCK TEST III

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Distributed Computation Models

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

TI2736-B Big Data Processing. Claudia Hauff

2/26/2017. Note. The relational algebra and the SQL language have many useful operators

MapReduce Algorithms

itpass4sure Helps you pass the actual test with valid and latest training material.

Certified Big Data and Hadoop Course Curriculum

The relational algebra and the SQL language have many useful operators

50 Must Read Hadoop Interview Questions & Answers

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Querying Data with Transact SQL

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Hadoop MapReduce Framework

QUERYING (BIG) DATA ON NOSQL STORES

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Advanced Data Management Technologies Written Exam

Hadoop Development Introduction

Developing MapReduce Programs

Tutorial for Assignment 2.0

The Analysis and Implementation of the K - Means Algorithm Based on Hadoop Platform

[3-5] Consider a Combiner that tracks local counts of followers and emits only the local top10 users with their number of followers.

Hadoop Online Training

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Commands Manual. Table of contents

TP1-2: Analyzing Hadoop Logs

Cloud Computing CS

15/03/2018. Counters

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

MapReduce Simplified Data Processing on Large Clusters

Introduction to HDFS and MapReduce

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Understanding NoSQL Database Implementations

Commands Guide. Table of contents

DePaul University CSC555 -Mining Big Data. Course Project by Bill Qualls Dr. Alexander Rasin, Instructor November 2013

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

A brief history on Hadoop

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Big Data Management and NoSQL Databases

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

MapReduce Design Patterns

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Techno Expert Solutions An institute for specialized studies!

Jaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center

Data-Intensive Computing with MapReduce

Your First Hadoop App, Step by Step

BigData And the Zoo. Mansour Raad Federal GIS Conference 2014

Tuning the Hive Engine for Big Data Management

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Big Data XML Parsing in Pentaho Data Integration (PDI)

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

MapReduce. Stony Brook University CSE545, Fall 2016

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

STORM AND LOW-LATENCY PROCESSING.

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

Hadoop Quickstart. Table of contents

Hadoop An Overview. - Socrates CCDH

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CS435 Introduction to Big Data Spring 2018 Colorado State University. 2/12/2018 Week 5-A Sangmi Lee Pallickara

Inria, Rennes Bretagne Atlantique Research Center

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Shark: Hive (SQL) on Spark

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

T-SQL Training: T-SQL for SQL Server for Developers

Transcription:

Big Data and Scripting map reduce in Hadoop 1,

2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks are executed sequential setting up the real thing on each involved node: configure storage for HDFS (hdfs-site.xml) configure Hadoop environment: set path to JRE, put hadoop/bin in path (hadoop-env.sh) configure ports (core-site.xml) configure ssh-access without password for Hadoop-user format HDFS hadoop namenode -format on master system configure slaves (slaves) start all nodes at once start-all.sh stop with stop-all.sh

3, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb Hadoop provides customization at all steps: input reading map combine partition shuffle and sort reduce output

4, input reading split up input file(s) into records of key and value key: usually positional information value: line/part of file keys have no influence on distribution to mappers individual input readers implement InputFormat use to split up e.g. XML-files into records input readers should split, but not parse the input input readers are run on individual blocks executed on nodes that store the corresponding blocks records that could overlap blocks have to be handled manually

5, combine combine patterns emitted on a node before distribution input data is fed into mapper mapper output on each node is fed into combiner output of combiner is distributed to reducers combiners compress output on a semantic level example word count: reduce 3 (word,1) to (word,3) one step to reduce network traffic compression of key value pairs on data level is implemented by framework combiners implement the Reducer interface

6, partition decide how key/value pairs are distributed to nodes output of partitioner is written to HDFS (one file per partition) nodes that start reducers pull (download) the files corresponding to their partition default implementation uses hash functions for random uniform distribution partitioner can influence distribution directly e.g. ensure that certain pairs end up in the same node output of these will also end up on the same node interface: Partitioner map key/value partition (int)

7, influencing the shuffle/sort each reducer-node starts a shuffle/sort process download partition files from mapper-nodes sort key/value pairs by key only parameter: custom Comparator for keys influence order and equality implement ordering within equivalent classes of keys

8, output analogous to import started individually for results on each reducer-node

9, generic information distribution transport information in addition to key/value pairs examples: small-scale parameters, e.g. number of clusters for k-means large-scale additional information, e.g. look-up tables use JobConf to distribute algorithm parameters set generic parameters using e.g. set(string name, String value) use config() implementation of mapper/reducer/partitioner/combiner to retrieve before execution each class is provided with configuration initialize class variables from general job configuration

10, generic information distribution large files should not be transported by JobConf instead access files via HDFS use DistributedCache to make local files available on all involved nodes DistributedCache.addCacheArchive(URI uri, Configuration conf) add archive DistributedCache.addCacheFile() add archive (same parameters) uri points to http:// or hdfs:// location files are added by reference to configuration when tasks are executed on particular nodes, files will be made available in working directory archives will be unpacked send executable script as new URI("hdfs://host:8020/script.awk#script")

11, accessing cached files Path[] DistributedCache.getLocalCacheArchives(conf) Path[] DistributedCache.getLocalCacheFiles(conf) retrieve list of all cached archives/files localization is implemented in framework

12, map reduce design patterns book: MapReduce Design Patterns (Miner, Shook, 2012, OReilly) design patterns describe mechanisms common to many algorithms solve similar problems in a number of contexts in a sense generic algorithms

13, Summarization general intention group records by common field aggregate all value with identical value in field examples: word count, mean, min, max, counting within groups similar to GROUP BY in SQL but with individual aggregation result: table with one entry (row) per group each row contains group key, aggregated value(s)

14, Summarization implementation: map each input element to group, remove all values not needed for aggregation combine group elements by partial aggregation, if possible reduce by aggregating values and returning group key/(aggregated values) combiner can drastically reduce network traffic custom partitioner can be necessary to resolved skewed group-size distribution

15, Counting counting is a very simple example of map reduce usage actually the reduce step is often not necessary for limited number of counters, counting can be implemented using map-only this can be achieved with the Reporter approach mappers count occurrences produce no output pairs no shuffle or reduction necessary functions: Reporter.incrCounter(String group, String counter, long amount) create/increase counter Reporter.getCounter(String group, String counter) versions with Enum-adressing available

16, Filtering intention filter input records by some property limit execution to those, that pass examples: distributed grep thresholding data cleansing random sampling as with counting, no reduce is necessary data is read and written from/to local node additional reduce can be used to write filter result to single file many small files slow down computations mapping all filtered results to single reducer allows to compact result

17, Distinct intention return only distinct values from a larger set filter out duplicates or values that are highly similar to each other implementation for each record, extract field considered in similarity use those as key, null as value shuffle will transport identical values to one reducer reducer only stores one copy combiner is extremely useful use many reducers reducer does not produce high load every distinct value needs another reducer instance single mapper produces many distinct values

18, structured to hierarchical intention convert a flat format (e.g. an SQL-table with value repetitions) into structured format example: given a table storing values for a foreign key each row contains set of values and key keys are repeated structure by storing key set of values implementation map rows to foreign key (higher element in desired structure) collect all values for each key and store as structured records can be repeated bottom up for several layers of structure used to convert flat data into structured records such as JSON/XML

19, partitioning intention partition data by some property (e.g. data) create smaller groups that can be analyzed individually example group log entries by date to allow analysis on a monthly basis implementation use identity mapper with partition property as key implement custom partitioner, creating partitions by key reducer only writes values to output all data is written to one logical file blocks are distributed as proposed in the partitioner e.g. all blocks for a particular month are on one DataNode

20, binning similar to partitioning, but one file per bin implementation map input values to bins use MultipleOutputs to write files no combine/shuffle/reduce: job.setnumreducetasks(0); writing is implemented directly in map-function write(string namedoutput, K key, V value) MultipleOutputs is configured from JobConfiguration problem: each involved mapper creates one file per bin each bin is distributed over all involved nodes

21, sorting with the shuffle step implementation of sort algorithm can exploit the sorting that takes place in the mapping step use the idea presented before: analyze phase determines sorting buckets order phase sorts analyze sample input and map to sort keys, without value set number of reducers to one shuffling sorts keys, single reducer gets sorted list of keys create slices of equal size

22, sorting - order phase mapper maps to sort key, values attached this time custom partitioner load partition mapping and applies it to input values TotalOrderPartitioner provides implementation for this step set number of reducers to number of partitions reducer only write incoming data to file (order is already correct) output is written to part-r-* files (number instead of *) files ordering corresponds to ordering of values values within files are sorted

23, Shuffle intention: destroy order of data arrangement motivation/applications anonymizing repeatable random sampling distribution of highly accessed parts to multiple nodes implementation map values to random key shuffling implements random distribution reducer only writes

24, Joins join as in SQL JOIN, join two tables by common key implementation map values to join key use partitioner for even distribution to reducers reducer collects values in temporary lists using external storage if necessary one list for each source table in final step reducer produces output pairs from lists can be used to implement all types of joins: inner, outer, left, right, anti all elements with common key are computed on single reducer can be problematic when many values have the same key

25, replicated join join very large data set with many small data sets large data set is left output elements correspond to elements of large data set small data sets fit into memory implementation mapper reads smaller data sets in initialization processes each record by joining it with elements from small data sets no combine/shuffle/reduce joined data is written directly after mapping

26, cartesian product intention: create all pairs of input values from data sets A and B application: pairwise analysis of elements (e.g. determine distance matrix) implementation create partitions of both data sets: A 1,..., A n (n parts), B 1,..., B m (m parts) repeat values in each partition A 1 1,..., Am 1,... use n m reducers each reducer receives all values from pair of parts A i, B j reducer produces all possible pairings