Big Data Analysis using Hadoop Lecture 3
|
|
- Phyllis Morrison
- 6 years ago
- Views:
Transcription
1 Big Data Analysis using Hadoop Lecture 3 Last Week - Recap Driver Class Mapper Class Reducer Class Create our first MR process Ran on Hadoop Monitored on webpages Checked outputs using HDFS command line and webpageas 1
2 In this Class Counters Combiners Partitionars Reading and Writing Data Chaining MapReduce Jobs (Workflows) Lab work Assignment Counters 2
3 Passing information back to the driver Counters provide a way for Mappers or Reducers to pass aggregated values back to the driver after the job has completed. Framework provides built in counters: Map-Reduce counters - e.g. number of input and output records for mapper, reducers, time and memory statistics File System counters - e.g. number of bytes read or written Job counters - e.g. launched tasks, failed tasks, etc... Counters are visible from the JobTracker UI Counters are reported on the console when the job finishes User Defined Counters User Defined Counters are a useful mechanism for gathering statistics about the job e.g. quality control - track different types of input record types, e,g, bad input records Instrumentation of code Numbers of warnings Number of errors Number of.. Framework aggregates all user defined counters over all mappers, reducers and reports back on UI enum GroupName { Counter1, Counter2,... Counters are set up as Enums in Mapper or Reducer context.getcounter(enum<?> countername); Counters are retrieved from the Context object which is passed to the Mapper and Reducer setvalue(long value); Increment or set the counter value: increment(long incrementamt); 3
4 Using Counters Set up the user-defined counters public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { static enum MyCounters{Bad, Good, Missing public void map(longwritable key, Text value, Context context)... Increment the counter when necessary... if (<input data problem>) { context.getcounter(mycounters.bad).increment(1);... Using Counters Print out the counters at the end of the Job in the Driver... job.waitforcompletion(true)? 0 : 1; System.out.println("Job is complete - printing counters now:"); Counters counters = job.getcounters(); Counter bad = counters.findcounter(mymapper.mycounters.bad); System.out.println("Number of bad records is "+ bad.getvalue());... 4
5
6 Combiners Combiners Mappers can produce a large amount of intermediate data generating significant network traffic when passed to Reducers. A Combiner is a mini-reducer runs locally on a single Mapper s output passes its output to the Reducer reduces the intermediate data passed to Reducer Can lead to faster jobs and less network traffic often can reduce the amount of work needed to be done by the Reducer may be the same code as the Reducer 6
7 Typical MR process MR process With Combiner WordCount Example Input to the Mapper: (124, this one I think is called a yink ) (158, he likes to wink, he likes to drink ) (195, he likes to drink and drink and drink ) Output from the Mapper: (this, 1) (one, 1) (I, 1) (think, 1) (is, 1) (called,1)(a, 1)(yink,1) (he, 1) (likes,1) (to,1) (wink,1) (he,1) (likes,1) (to,1) (drink,1) (he,1) (likes,1) (to,1) (drink 1) (and,1) (drink,1) (and,1) (drink,1) 7
8 Example Job What happens with only a Reducer Intermediate data sent to the Reducer (a, [1]) (and,[1,1]) (called, [1]) (drink, [1,1,1,1]) (he, [1,1,1]) (I, [1]) (is, [1]) (likes, [1,1]) (one, [1]) (think, [1]) (this, [1]) (to, [1,1,1]) (wink, [1]) (yink, [1] ) Reducer output (a,1) (and,2) (called,1) (drink,4) (he,3) (I,1) (is,1) (likes,2) (one,1) (think,1) (this,1) (to,3) (wink,1) (yink,1) When we use a Combiner Output from Mapper (a, [1]) (and,[1,1]) (called, [1]) (drink, [1,1,1,1]) (he, [1,1,1]) (I, [1]) (is, [1]) (likes, [1,1]) (one, [1]) (think, [1]) (this, [1]) (to, [1,1,1]) (wink, [1]) (yink, [1] ) Combiner Output (a, [1]) (and,[2]) (called, [1]) (drink, [4]) (he, [3]) (I, [1]) (is, [1]) (likes, [2]) (one, [1]) (think, [1]) (this, [1]) (to, [3]) (wink, [1]) (yink, [1] ) Combiner output (a,1) (and,2) (called,1) (drink,4) (he,3) (I,1) (is,1) (likes,2) (one,1) (think,1) (this,1) (to,3) (wink,1) (yink,1) 8
9 When we use a Combiner Intermediate data sent to the Reducer Combiner output (a, [1]) (and,[2]) (called, [1]) (drink, [4]) (he, [3]) (I, [1]) (is, [1]) (likes, [2]) (one, [1]) (think, [1]) (this, [1]) (to, [3]) (wink, [1]) (yink, [1] ) Reducer output (a,1) (and,2) (called,1) (drink,4) (he,3) (I,1) (is,1) (likes,2) (one,1) (think,1) (this,1) (to,3) (wink,1) (yink,1) Other Combiner outputs Specifying a Combiner Set the Combiner up in the job configuration in the Driver job.setcombinerclass(yourcombinerclass.class) The Combiner uses the same interface as the Reducer takes in a key and a list of values (output from the Mapper) outputs zero or more lists of key/value pairs the work is done in the reduce method Note: The Combiner input types must be the same as the Mapper output types (K2, V2) The Combiner output types must be the same as the Reducer input types (K2, V2) Don t put code in the combiner that alters the data from the Mapper 9
10 Example Code 1 1 The Reducer code is used for the Combiner Example Code Sometimes the code can be slightly different between the Combiner and the Reducer You need to be careful to maintain the input and output formats 3 10
11 Partitioners Partitioners The number of Reducers that run is specified in the job configuration the default number is 1 The number of Reducers can be set when setting up the job setnumreducetasks(value) job.setnumreducetasks(10); No need to set this value if you just want one Reducer Partitioners implementation directs key-value pairs to a specific reducer Number of Partitions = Number of Reducers Default is to hash key the determine partition implemented HashPartitioner<K,V> 11
12 Shape = key Inner patterns = values Note: All keys go to the same reducer A reducer can handle different keys Reducers can have different loads Partitioners Shuffle & Sort 12
13 Partitioners Partitioners determine which reducer the map output data is sent to in the shuffle & sort phase Normally determined using a hash function on the key value Important that the Partitioner will distribute the map output records evenly over the reducers job.setpartitionerclass(casepartitioner.class); job.setnumreducetasks(10); Default Partitioner The default Partitioner is the HashPartitioner The key s hash code is turned into a non negative integer by bitwise ANDing it with the largest integer value It is then reduced modulo the number of partitions to find the index of the partition (the reducer number) that the record belongs to public int getpartition(k2 key, V2 value,int numpartitions) { return (key.hashcode() & Integer.MAX_VALUE) % numpartitions; s distributed evenly across available reduce tasks Assuming good hashcode() function s with same key will make it into the same reduce task 13
14 Default Partitioner Reducer 0 Key Hashcode Modulo 3 This 1 1 is 2 2 not 3 0 my 4 1 office 5 2 Colour 6 0 Pen 7 1 Money 8 2 Reducer 1 Reducer 2 To Implement a Custom Partitioner... Set the number of reducers in the job configuration Create a custom Partitioner class by extending the Partitioner class implement the getpartition() method to return a number between 0 and the number of reducers indexing to which reducer that the key/value pair should be sent. public class MyPartitioner extends Partitioner<KEY, VALUE> { public int getpartition(key key, VALUE value, int numpartitions){ // put code here to decide based on the key which // reducer the map output should go to... return partitionnumber; 14
15 Simple MapReduce Job Complete MapReduce Job 15
16 Drawbacks Need to know the number of partitions/reducers at the start, not dynamic. Letting the application fix the number of reducers rather than the cluster can result in inefficient use of the cluster and uneven reduce jobs that can dominate the job execution time. See MultipleOutputs later on for an alternative solution Partitioner example code
17 Reading & Writing Data 17
18 Logical split created by an InputFormat Each split is processed by a single mapper - Data locality Each record (key value pair) is processed by the map method File HDFS Physical locations Block Block Block InputSplit InputSplit InputSplit InputSplit InputSplit InputSplit Mapper Mapper Mapper Mapper Mapper Mapper map map map map map map Input Data The input data is split into chunks called input splits (logical division) split size is normally the size of a block - but is configurable The size of the splits determines the level of parallelisation One input split a mapper All input data in one single split no parallelisation Small input split useful for parallelisation of CPU bound tasks HDFS stores input data in blocks spread across the nodes (physical division) One block one input split (very efficient for I/O bound tasks) An input split may be split across blocks - Hadoop guarantees the processing of all records but data-local mappers will need to access remote data infrequently. 18
19 Input Data Data input is supported by InputFormat: indicates how the input files should be split into input splits Reader: performs the reading of the data providing a key/value pair for input to the mapper Hadoop provides predefined InputFormats TextInputFormat : default input format KeyValueTextInputFormat SequenceFileInputFormat (normally used for chaining multiple MapReduce jobs) NLineInputFormat The input format can be set in the job configuration in your driver, e.g. job.setinputformatclass(keyvaluetextinputformat.class) To read input data in a way not supported by the standard InputFormat classes you can create a custom InputFormat 19
20 Output Data Each reducer writes its output to its own file normally named part-nnnnn, where nnnnn is the partition ID of the reducer Data output is supported by OutputFormat Writer Hadoop provides predefined OutputFormats TextOutputFormat : default output format SequenceFileOutputFormat (normally used for chaining multiple MapReduce jobs) NullOutputFormat job.setoutputformatclass(sequencefileoutputformat.class) 20
21 Writing Multiple Files MultipleOutputs allows you to write data to multiple files whose names are derived from the output keys and values Output file names are of the form name-x-nnnn name is set by the code x = m, for mapper output x = r, for reducer output nnnn is an integer designating the part number Using MultipleOutputs Create an instance of MultipleOutputs in reducer or mapper where the output is being generated normally done in the setup() protected void setup(context context) throws IOException, InterruptedException{ multipleoutputs = new MultipleOutputs<KEY, VALUE>(context); Close the MultipleOutputs instance once finished with it normally done in the cleanup() protected void cleanup(context context) IOException,InterruptedException { throws multipleoutputs.close(); 21
22 Using MultipleOutputs Write the output key value pair to the instance of MultipleOutputs where name identifies the base output path name is interpreted relative to the output directory so it is possible to create subdirectories by including file path separator characters in name Include logic to determine which output file to write to normally dependent on key and/or value e.g. monthly reports, weekly reports, files identified by time periods e.g. store or branch reports, files identified by store or branch, or both... multipleoutputs.write(key, value, name ); Using MultipleOutputs MultipleOutputs delegates to the given OutputFormat separate named outputs can be set up in the driver using addnamedoutput each with its own OutputFormat and key value types MultipleOutputs.addNamedOutput(job, name, OUTPUTFORMAT, KEY, VALUE ); A single record can be written to multiple output files 22
23 Reading/Writing other types of data Reading to/from a database using JDBC Can use the DBInputFormat and DBOutputFormat DBInputFormat doesn t have sharding capabilities so you have to be careful not to overwhelm the database by reading with too many mappers DBOutputFormat very useful for outputting data to a database Reading XML Create a custom InputFormat to read a whole file at a time (see Chpt 7 in Hadoop the Definitive Guide), suitable for small XML files, or Use XMLInputFormat in Mahout (the machine learning library that is implemented on Hadoop) Chaining MapReduce Jobs 23
24 Chaining MapReduce Jobs We ve looked at single MapReduce jobs. Not every problem can be solved with a MapReduce job. Map-Reduce can get very complex with multiple MR jobs. Many problems can be solved with MapReduce, by writing serval MapReduce steps which run in series, or parallel or both Can control these and get them to interact with each other (dependencies) Gives better control and allows for greater computational capabilities Chaining Jobs in a Sequence MapReduce 1 MapReduce 2 MapReduce 3... Run MapReduce jobs sequentially with the output of one job being the input of another Note: Watch intermediate file format... SequenceFileOutputFormat & SequenceFileInputFormat are useful for this. Remember: Sequence files are Hadoop s compressed binary file format for storing key/value pairs. Set up a Job (job1) in the Driver - run job1; Then set up a new Job (job 2) in the driver, with the input path = the output path of job1 - run job2, etc... 24
25 Chaining Jobs with Dependencies MapReduce 1 MapReduce 2 MapReduce 3 Dependencies can occur when tasks don t run sequentially Use Job, ControlledJob & JobControl classes to setup and manage job dependencies Setting up workflows JobControl is created to hold the workflow Allows for the creation of simple workflows Represents a graph of Jibs to run Specify dependencies in code JobControl control = new JobControl("Workflow-Example"); A ControlledJob is set up for each job in the workflow ControlledJob step1 = new ControlledJob(job1, null); ControlledJob is a wrapper for Job ControlledJob constructor can take in job dependencies List<ControlledJob> dependencies = new ArrayList<ControlledJob>(); dependencies.add(step1); ControlledJob step2 = new ControlledJob(job2, dependencies); Dependencies between jobs can also be setup using adddependingjob(), e.g. step2.adddependingjob(step1) means step1 will not start until step1 has finished 25
26 Setting up workflows Each ControlledJob is added to the JobControl object using addjob()... control.addjob(step1); control.addjob(step2);... The JobControl is executed in a thread JobControl implements Runnable... Thread workflowthread = new Thread(control,"Workflow-Thread"); workflowthread.setdaemon(true); workflowthread.start();... Setting up workflows Wait for JobControl to complete and report results... while (!control.allfinished()){ Thread.sleep(500); if (control.getfailedjoblist().size() > 0 ){ log.error(control.getfailedjoblist().size() + " jobs failed!"); for ( ControlledJob job : control.getfailedjoblist()){ log.error(job.getjobname() + " failed"); else { log.info("success!! Workflow completed [" + control.getsuccessfuljoblist().size() + "] jobs");... JobControl has methods to allow monitoring and tracking of its jobs 26
27 Chaining preprocessing & postprocessing steps Sequential Jobs: Modular Jobs: [ map reduce ]+ map+ reduce map Preprocessing (and postprocessing) might require a number of Mappers to run sequentially, e.g. text preprocessing Sequential jobs (using identity Reducer) are inefficient Use ChainMapper and ChainReducer to implement modular pre- and postprocessing steps each mapper is added with its own job configuration parameters to ChainMapper or ChainReducer each mapper can be run individually, useful for testing/debugging MapReduce Algorithms Available in pdf at 27
28 Programmer Control The ability to construct complex data structures as keys and values to store and communicate partial results The ability to to execute user-specified initialisation code at the beginning of a map or reduce task, and the ability to execute user-specified termination code at the end of a map or reduce task. The ability to preserve state in both mappers and reducers across multiple input or intermediate keys. The ability to control the sort order of intermediate keys, and therefore the order in which a reducer will encounter particular keys. The ability to control the partitioning of the key space, and therefore the set of keys that will be encountered by a particular reducer
29 Useful code! 29
30 User-defined types - example public class TextPair implements WritableComparable<TextPair> { private Text first; private Text second; public TextPair() { set(new Text(), new Text()); public TextPair(String first, String second) { set(new Text(first), new Text(second)); public TextPair(Text first, Text second) { set(first, second); public void set(text first, Text second) { this.first = first; this.second = second; public Text getfirst() { return first; public Text getsecond() { return public void write(dataoutput out) throws IOException { first.write(out); public void readfields(datainput in) throws IOException { first.readfields(in); public int compareto(textpair tp) { int cmp = first.compareto(tp.first); if (cmp!= 0) { return cmp; return public int hashcode() { return first.hashcode() * public boolean equals(object o) { if (o instanceof TextPair) { TextPair tp = (TextPair) o; return first.equals(tp.first) && second.equals(tp.second); return public String tostring() { return first + "\t" + second; 30
IntWritable w1 = new IntWritable(163); IntWritable w2 = new IntWritable(67); assertthat(comparator.compare(w1, w2), greaterthan(0));
factory for RawComparator instances (that Writable implementations have registered). For example, to obtain a comparator for IntWritable, we just use: RawComparator comparator = WritableComparator.get(IntWritable.class);
More informationIntroduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece
Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of
More informationUNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus
UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.
More informationBig Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2
Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer
More informationGhislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)
Ghislain Fourny Big Data 6. Massive Parallel Processing (MapReduce) So far, we have... Storage as file system (HDFS) 13 So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14 Data
More informationData-Intensive Computing with MapReduce
Data-Intensive Computing with MapReduce Session 2: Hadoop Nuts and Bolts Jimmy Lin University of Maryland Thursday, January 31, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationGhislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)
Ghislain Fourny Big Data Fall 2018 6. Massive Parallel Processing (MapReduce) Let's begin with a field experiment 2 400+ Pokemons, 10 different 3 How many of each??????????? 4 400 distributed to many volunteers
More informationParallel Computing. Prof. Marco Bertini
Parallel Computing Prof. Marco Bertini Apache Hadoop Chaining jobs Chaining MapReduce jobs Many complex tasks need to be broken down into simpler subtasks, each accomplished by an individual MapReduce
More informationHadoop Map Reduce 10/17/2018 1
Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018
More informationChapter 3. Distributed Algorithms based on MapReduce
Chapter 3 Distributed Algorithms based on MapReduce 1 Acknowledgements Hadoop: The Definitive Guide. Tome White. O Reilly. Hadoop in Action. Chuck Lam, Manning Publications. MapReduce: Simplified Data
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (1/2) January 10, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationCloud Computing CS
Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part
More informationBig Data and Scripting map reduce in Hadoop
Big Data and Scripting map reduce in Hadoop 1, 2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks
More informationMapReduce Algorithm Design
MapReduce Algorithm Design Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul Big
More informationHadoop. Introduction to BIGDATA and HADOOP
Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationBig Data XML Parsing in Pentaho Data Integration (PDI)
Big Data XML Parsing in Pentaho Data Integration (PDI) Change log (if you want to use it): Date Version Author Changes Contents Overview... 1 Before You Begin... 1 Terms You Should Know... 1 Selecting
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationRecommended Literature
COSC 6397 Big Data Analytics Introduction to Map Reduce (I) Edgar Gabriel Spring 2017 Recommended Literature Original MapReduce paper by google http://research.google.com/archive/mapreduce-osdi04.pdf Fantastic
More informationHadoop. copyright 2011 Trainologic LTD
Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides
More information15/03/2018. Combiner
Combiner 2 1 Standard MapReduce applications The (key,value) pairs emitted by the Mappers are sent to the Reducers through the network Some pre-aggregations could be performed to limit the amount of network
More informationMap-Reduce Applications: Counting, Graph Shortest Paths
Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/
More informationHortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :
Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.
More informationIntroduction to Map Reduce
Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationIntroduction to Map/Reduce & Hadoop
Introduction to Map/Reduce & Hadoop Vassilis Christophides christop@csd.uoc.gr http://www.csd.uoc.gr/~hy562 University of Crete 1 Peta-Bytes Data Processing 2 1 1 What is MapReduce? MapReduce: programming
More informationCloud Computing. Up until now
Cloud Computing Lecture 9 Map Reduce 2010-2011 Introduction Up until now Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing Distributed Scheduling 1 Outline Map Reduce:
More informationLaarge-Scale Data Engineering
Laarge-Scale Data Engineering The MapReduce Framework & Hadoop Key premise: divide and conquer work partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 result combine Parallelisation challenges How
More informationHADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)
HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big
More informationMap-Reduce Applications: Counting, Graph Shortest Paths
Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/
More informationClustering Documents. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve
More informationCSE6331: Cloud Computing
CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2017 by Leonidas Fegaras Map-Reduce Fundamentals Based on: J. Simeon: Introduction to MapReduce P. Michiardi: Tutorial on MapReduce
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 16 Big Data Management VI (MapReduce Programming) Credits: Pietro Michiardi (Eurecom): Scalable Algorithm
More information15/03/2018. Counters
Counters 2 1 Hadoop provides a set of basic, built-in, counters to store some statistics about jobs, mappers, reducers E.g., number of input and output records E.g., number of transmitted bytes Ad-hoc,
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve
More informationMapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec
MapReduce: Recap Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Sequentially read a lot of data Why? Map: extract something we care about map (k, v)
More informationTopics covered in this lecture
9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.0 CS435 Introduction to Big Data 9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.1 FAQs How does Hadoop mapreduce run the map instance?
More informationIntroduction to Map/Reduce & Hadoop
Introduction to Map/Reduce & Hadoop V. CHRISTOPHIDES University of Crete & INRIA Paris 1 Peta-Bytes Data Processing 2 1 1 What is MapReduce? MapReduce: programming model and associated implementation for
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationHadoop Lab 3 Creating your first Map-Reduce Process
Programming for Big Data Hadoop Lab 3 Creating your first Map-Reduce Process Lab work Take the map-reduce code from these notes and get it running on your Hadoop VM Driver Code Mapper Code Reducer Code
More informationExam Name: Cloudera Certified Developer for Apache Hadoop CDH4 Upgrade Exam (CCDH)
Vendor: Cloudera Exam Code: CCD-470 Exam Name: Cloudera Certified Developer for Apache Hadoop CDH4 Upgrade Exam (CCDH) Version: Demo QUESTION 1 When is the earliest point at which the reduce method of
More informationLocal MapReduce debugging
Local MapReduce debugging Tools, tips, and tricks Aaron Kimball Cloudera Inc. July 21, 2009 urce: Wikipedia Japanese rock garden Common sense debugging tips Build incrementally Build compositionally Use
More informationExample of a use case
2 1 In some applications data are read from two or more datasets The datasets could have different formats Hadoop allows reading data from multiple inputs (multiple datasets) with different formats One
More informationOutline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop
Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School September 2012 This work is licensed
More informationParallel Dijkstra s Algorithm
CSCI4180 Tutorial-6 Parallel Dijkstra s Algorithm ZHANG, Mi mzhang@cse.cuhk.edu.hk Nov. 5, 2015 Definition Model the Twitter network as a directed graph. Each user is represented as a node with a unique
More informationMap Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms
Map Reduce 1 MapReduce inside Google Googlers' hammer for 80% of our data crunching Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 1: MapReduce Algorithm Design (4/4) January 16, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationSTATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns
STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big
More informationProgramming Models MapReduce
Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases
More informationBig Data: Architectures and Data Analytics
Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer
More informationBig Data: Architectures and Data Analytics
Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationitpass4sure Helps you pass the actual test with valid and latest training material.
itpass4sure http://www.itpass4sure.com/ Helps you pass the actual test with valid and latest training material. Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Vendor : Cloudera
More informationIntroduction to MapReduce
732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server
More informationMapReduce Simplified Data Processing on Large Clusters
MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /
More informationParallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018
Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much
More informationsqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010
sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010 Your database Holds a lot of really valuable data! Many structured tables of several hundred GB Provides fast access
More informationCS 378 Big Data Programming
CS 378 Big Data Programming Lecture 5 Summariza9on Pa:erns CS 378 Fall 2017 Big Data Programming 1 Review Assignment 2 Ques9ons? mrunit How do you test map() or reduce() calls that produce mul9ple outputs?
More informationMapReduce Design Patterns
MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together
More informationHadoop MapReduce Framework
Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce
More informationRecommended Literature
COSC 6339 Big Data Analytics Introduction to Map Reduce (I) Edgar Gabriel Fall 2018 Recommended Literature Original MapReduce paper by google http://research.google.com/archive/mapreduce-osdi04.pdf Fantastic
More informationMap Reduce and Design Patterns Lecture 4
Map Reduce and Design Patterns Lecture 4 Fang Yu Software Security Lab. Department of Management Information Systems College of Commerce, National Chengchi University http://soslab.nccu.edu.tw Cloud Computation,
More informationDatabase Applications (15-415)
Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April
More informationAnnouncements. Lab Friday, 1-2:30 and 3-4:30 in Boot your laptop and start Forte, if you brought your laptop
Announcements Lab Friday, 1-2:30 and 3-4:30 in 26-152 Boot your laptop and start Forte, if you brought your laptop Create an empty file called Lecture4 and create an empty main() method in a class: 1.00
More informationProcessing Distributed Data Using MapReduce, Part I
Processing Distributed Data Using MapReduce, Part I Computer Science E-66 Harvard University David G. Sullivan, Ph.D. MapReduce A framework for computation on large data sets that are fragmented and replicated
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationVendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo
Vendor: Cloudera Exam Code: CCD-410 Exam Name: Cloudera Certified Developer for Apache Hadoop Version: Demo QUESTION 1 When is the earliest point at which the reduce method of a given Reducer can be called?
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationAgenda CS121/IS223. Reminder. Object Declaration, Creation, Assignment. What is Going On? Variables in Java
CS121/IS223 Object Reference Variables Dr Olly Gotel ogotel@pace.edu http://csis.pace.edu/~ogotel Having problems? -- Come see me or call me in my office hours -- Use the CSIS programming tutors Agenda
More informationBig Data: Architectures and Data Analytics
Big Data: Architectures and Data Analytics January 22, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer
More informationCS435 Introduction to Big Data Spring 2018 Colorado State University. 2/12/2018 Week 5-A Sangmi Lee Pallickara
W5.A.0.0 CS435 Introduction to Big Data W5.A.1 FAQs PA1 has been posted Feb. 21, 5:00PM via Canvas Individual submission (No team submission) Source code of examples in lectures: https://github.com/adamjshook/mapreducepatterns
More informationHadoop & Big Data Analytics Complete Practical & Real-time Training
An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE
More informationProgramming with Hadoop MapReduce. Kostas Solomos Computer Science Department University of Crete, Greece
Programming with Hadoop MapReduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover Diving deeper into Hadoop architecture Using a shared input for the mappers/reducers
More informationMap Reduce. Yerevan.
Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate
More informationCS121/IS223. Object Reference Variables. Dr Olly Gotel
CS121/IS223 Object Reference Variables Dr Olly Gotel ogotel@pace.edu http://csis.pace.edu/~ogotel Having problems? -- Come see me or call me in my office hours -- Use the CSIS programming tutors CS121/IS223
More informationGiraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi
Giraph: Large-scale graph processing infrastructure on Hadoop Qu Zhi Why scalable graph processing? Web and social graphs are at immense scale and continuing to grow In 2008, Google estimated the number
More informationExpert Lecture plan proposal Hadoop& itsapplication
Expert Lecture plan proposal Hadoop& itsapplication STARTING UP WITH BIG Introduction to BIG Data Use cases of Big Data The Big data core components Knowing the requirements, knowledge on Analyst job profile
More informationVendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo
Vendor: Hortonworks Exam Code: HDPCD Exam Name: Hortonworks Data Platform Certified Developer Version: Demo QUESTION 1 Workflows expressed in Oozie can contain: A. Sequences of MapReduce and Pig. These
More informationMapReduce-style data processing
MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic
More informationSnapshots and Repeatable reads for HBase Tables
Snapshots and Repeatable reads for HBase Tables Note: This document is work in progress. Contributors (alphabetical): Vandana Ayyalasomayajula, Francis Liu, Andreas Neumann, Thomas Weise Objective The
More informationData Analytics Job Guarantee Program
Data Analytics Job Guarantee Program 1. INSTALLATION OF VMWARE 2. MYSQL DATABASE 3. CORE JAVA 1.1 Types of Variable 1.2 Types of Datatype 1.3 Types of Modifiers 1.4 Types of constructors 1.5 Introduction
More informationCS 231 Data Structures and Algorithms, Fall 2016
CS 231 Data Structures and Algorithms, Fall 2016 Dr. Bruce A. Maxwell Department of Computer Science Colby College Course Description Focuses on the common structures used to store data and the standard
More informationData abstractions: ADTs Invariants, Abstraction function. Lecture 4: OOP, autumn 2003
Data abstractions: ADTs Invariants, Abstraction function Lecture 4: OOP, autumn 2003 Limits of procedural abstractions Isolate implementation from specification Dependency on the types of parameters representation
More informationCS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Return type for collect()? Can
More informationExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you
ExamTorrent http://www.examtorrent.com Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you Exam : Apache-Hadoop-Developer Title : Hadoop 2.0 Certification exam for Pig
More informationIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -
More informationsqoop Automatic database import Aaron Kimball Cloudera Inc. June 18, 2009
sqoop Automatic database import Aaron Kimball Cloudera Inc. June 18, 2009 The problem Structured data already captured in databases should be used with unstructured data in Hadoop Tedious glue code necessary
More informationActual4Dumps. Provide you with the latest actual exam dumps, and help you succeed
Actual4Dumps http://www.actual4dumps.com Provide you with the latest actual exam dumps, and help you succeed Exam : HDPCD Title : Hortonworks Data Platform Certified Developer Vendor : Hortonworks Version
More informationSection 05: Solutions
Section 05: Solutions 1. Asymptotic Analysis (a) Applying definitions For each of the following, choose a c and n 0 which show f(n) O(g(n)). Explain why your values of c and n 0 work. (i) f(n) = 5000n
More informationLecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!
Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm
More informationMAT 3670: Lab 3 Bits, Data Types, and Operations
MAT 3670: Lab 3 Bits, Data Types, and Operations Background In previous labs, we have used Turing machines to manipulate bit strings. In this lab, we will continue to focus on bit strings, placing more
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationHadoop Streaming. Table of contents. Content-Type text/html; utf-8
Content-Type text/html; utf-8 Table of contents 1 Hadoop Streaming...3 2 How Does Streaming Work... 3 3 Package Files With Job Submissions...4 4 Streaming Options and Usage...4 4.1 Mapper-Only Jobs...
More informationML from Large Datasets
10-605 ML from Large Datasets 1 Announcements HW1b is going out today You should now be on autolab have a an account on stoat a locally-administered Hadoop cluster shortly receive a coupon for Amazon Web
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationApril Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.
1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map
More informationComplex stories about Sqooping PostgreSQL data
Presentation slide for Sqoop User Meetup (Strata + Hadoop World NYC 2013) Complex stories about Sqooping PostgreSQL data 10/28/2013 NTT DATA Corporation Masatake Iwasaki Introduction 2 About Me Masatake
More informationAssignment 4: Hashtables
Assignment 4: Hashtables In this assignment we'll be revisiting the rhyming dictionary from assignment 2. But this time we'll be loading it into a hashtable and using the hashtable ADT to implement a bad
More information1/30/2019 Week 2- B Sangmi Lee Pallickara
Week 2-A-0 1/30/2019 Colorado State University, Spring 2019 Week 2-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING Term project deliverable
More information