3. Big Data Processing

Size: px

Start display at page:

Download "3. Big Data Processing"

Pearl McDonald
5 years ago
Views:

1 3. Big Data Processing Cloud Computing & Big Data MASTER ENGINYERIA INFORMÀTICA FIB/UPC Fall Jordi Torres, UPC - BSC

2 Slides are only for presentation guide We will discuss+debate additional concepts/ideas appeared during your participation! (and we could skip part of the content) FEEL FREE TO PARTICIPATE!

3 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 3

4 My motivation for including it now! Cloud Computing? Cloud Computing & Big Data Big Data has become a hot topic in the field of Information and Communication Technology (ICT) in recent years, impossible to separate it from Cloud Computing. Particularly, the exponential and fast growth of very different types of data has quickly raised concerns about how to store, manage, process and analyse the data. For these reasons I considered that this topic have to be included in this course. I hope you enjoy it! :-) 4

5 Motivation 5

6 Motivation 6

7 Motivation Source: 7

8 Motivation Cloud Computing Big Data 8

9 Big Data? Do you need a definition? is data that becomes large enough that it cannot be processed using conventional methods. enough for you? :-) Petabytes of data created daily social networks, mobile phones, sensors, science, Source: 11/06/28/digital-universe-to-add-1-8-zettabytes-in- 2011/?utm-source=feedburner&utm-medium=feed&utmcampaign=Feed:+DataCenterKnowledge+%28Data 9

10 SOURCE: 10

11 SOURCE: 11

12 SOURCE: 12

13 Internet of Things 13

14 Future of Cloud: Fog Computing? 14

15 New requirements for Cloud: For example: Barcelona Smart City 15

Why is Big Data Important 40% projected growth in global data generated per year vs 5% projected growth in global IT spending 60% Potential increase in retailers

16 Why is Big Data Important 40% projected growth in global data generated per year vs 5% projected growth in global IT spending 60% Potential increase in retailers operating margins possible with big data (*) Source: Big Data: The next frontier for innovation, competition and productivity Mckinsey Global Institute, July

17 Why is Big Data Important Data is more important than ever, but the exponential growth of data has overwhelmed most company's ability to manage (and monetize it). BIG MARKET FOR NEW COMPANIES: Your company? 17

18 Big Data Source: Data Scale Exa Peta Tera Giga Mega Kilo Up to 10,000 Times larger Data at Rest Traditional Data Warehouse and Business Intelligence Data in Motion yr mo wk day hr min sec ms µs Occasional Frequent Real-time Decision Frequency 18

19 My definition :-) Big Data is data that exceeds the storing, processing and managing capacity of conventional systems. The reason is that the data is too big, moves too fast, or doesn t fit the structures of our current systems architectures. Moreover, to gain value from this data, we must change the way to analyze it. 19

20 Big Data VOLUME Terabytes? Petabytes? Exabytes? Zettabytes? 20

21 1 Gigabyte (GB) = byte 1 Terabyte (TB) = Gigabyte (GB) 1 Petabyte (PB) = Gigabyte (GB) 1 Exabyte (EB) = Gigabyte (GB) 1 Zettabyte (ZB) = (GB) 21

22 Big Data: VARIETY Source: Toni Brey Urbiotica.com 22

23 Big Data: VARIETY Data Growth is Increasingly Unstructured Structured Data containing a defined data type, format, structure E.g. Transactional Data Base Semi-Structured Textual data files with a discernable pattern, enabling parsing E.g. XML data file + xml schema Quasi Structured Textual data with erratic data formats E. g. Web clickstream (may contain inconsistencies) Unstructured No inherent structure and different types of files E.g. PDFs, images, videos. 23

24 Big Data: VELOCITY Real-time required 24

25 Summary: Volume: Large Volumes of data Terabytes, Petabytes, Data that cannot be stored in conventional RDBMS Variety: Source data is diverse Web Logs, Application Logs, Machine generated data, Social network data, etc. Doesn't fall into neat relational structures Unstructured, Semistructured Velocity: Streaming data, Complex Event Processing data Velocity of incoming data and Speed of responding to it 25

26 Big Data VOLUME + VARIETY + VELOCITY BIG DATA CHALLENGES? 26

27 Big Data Challenges Data storage Data processing Data management Data analysis and (not include in this course :-P) DEBATE!!! 27

28 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 28

as a key to analyst productivity (realtime as in query times fast enough to keep the

29 Current constraints of conventional IT Execution Time new requirements for real-time decisions Conventional Systems Interactive or realtime query for large datasets is seen as a key to analyst productivity (realtime as in query times fast enough to keep the user in the flow of analysis, from sub-second to less than a few minutes). GBs Data Volume PBs 29

30 Today IT technology The existing large-scale data management schemes aren t fast enough and reduce analytical effectiveness when users can t explore the data by quickly iterating through various query schemes. MEMORY (= fast, expensive, volatile ) STORAGE (= slow, cheap, non-volatile) HHD 100 cheaper than RAM But 1000 times slower 30

31 New proposals: in-memory Execution time research In-memory GBs PBs 31 31

32 In-memory optimizations BI example: SOURCE: 32

33 In-memory optimizations results SOURCE: 33

34 Some of the current in-memory tools: We see companies with large data stores building out their own in-memory tools, e.g., Dremel at Google, Druid at Metamarkets, and Sting at Netflix, and new tools, like Cloudera s Impala announcement at the conference, UC Berkeley s AMPLab s Spark, SAP Hana, and Platfora Source: html?imm_mid=09b70d&cmp=em-strata-newsletters-nov14-direct#more

(SSD) Not volatile Storage Class Memory (SCM) 35 Source:

35 In-memory: new storage tech required Data storage challenges: Present solutions: Research: Solid- state drive (SSD) Not volatile Storage Class Memory (SCM) 35 Source: ano_devices/memoryaddressing/

36 SCM candidates: should be non-volatile, low-cost, high performance, high reliable solid-state memory Currently available Flash technology falls short of these requirements Some new type of SCM technology need to emerge (not my expertise! :-( Some Candidates Improved Flash FeRAM (Ferroelectric RAM) MRAM (Magnetic RAM) Phase Change Memory RRAM (Resistive RAM) Solid Electrolyte Organic and Polymeric Memory 36

Debate Old Compute-centric Model Manycore

Parallelism Persistent Memory Flash Phase

Many CPU s surround and use Shallow/Flat

37 Debate Old Compute-centric Model Manycore FPGA New Data-centric Model Massive Parallelism Persistent Memory Flash Phase Change Data lives on disk and tape Move data to CPU as needed Deep Storage Hierarchy Data lives in persistent memory Many CPU s surround and use Shallow/Flat Storage Hierarchy Source: Heiko Joerg

38 Debate Source: 38

39 Debate Source: 39

40 Debate Source: 40

41 Debate Source: 41

Debate Remote DMA Source: http://www.slideshare.

42 Debate Remote DMA Source: 42

43 Debate Source: 43

44 Debate Source: 44

45 Debate: Big Data storage HHD 100 cheaper than RAM But 1000 times slower Source: 45

46 Debate: Big Data storage Source: 46

47 Market: Infiniband to the public cloud? 47

48 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 48

49 Traditional Large-Scale Computation Traditionally: The primary push is to increase the computation power of a single machine Processor-bound workloads Normally with small amount of data Complex processing performed on that data 49

50 Traditional Large-Scale Computation Moore s Law: roughly stated, processing power doubles every two years it is not enough! Distributed Systems evolved to allow developers to use multiple machines for a single job MPI OpenMP OmpSs/COMPSs 50

Distributed Computing: Commodity hardware Supercomputers are not affordable for everybody Commodity hardware Programming for traditional distributed systems is complex (main challenges and

51 Distributed Computing: Commodity hardware Supercomputers are not affordable for everybody Commodity hardware Programming for traditional distributed systems is complex (main challenges and bottlenecks) Data exchange requires synchronization Bandwidth limitations Temporal dependencies It is difficult to deal with partial failures of the system performed by commodity hardware 51

52 Challenges: Failure Failure is the defining difference between distributed and local programming Design distributed systems with the expectation of failure Problem: Developers spend more time designing for failure than they do actually working on the problem itself A new approach for Failure is needed!!! 52

53 Challenges: Data (reminder) Typically in a traditional Large-Scale Computation Data is stored on a Storage Area Network At compute time, data is copied to the compute nodes Fine for relatively few amounts of data However as we discussed, modern systems have to deal with far more data than was the case in the past Getting the data to the processors becomes the bottleneck Ex: Transfer 50 Gb at 75Mb/sec (typical disk rate transfer) 11 minutes aprox. Ex: Transfer 1 Tb ( 1000 Gb)???? A new approach for Data is needed!!! 53

54 New approach required DATA: Not move the data at compute time, have the data already there FAILURE: The system must support partial failure If a component of the system fails: Do not require a full restart of the entire system CONSISTENCY: Component failures during execution of a job should not affect the outcome of the job SCALABILITY Increasing resources should support a proportional increase in load capacity 54

55 MapReduce To meet the challenges: MapReduce Programming Model introduced by Google in early 2000s to support distributed computing on large data sets on clusters of computers Applications are written in high-level programming model: Developpers do not worry about network programming, temporal dependencies, etc. This work takes a radical new approach to the problem of distributed computing Distribute the data as it is initially stored in the system Data is replicated multiples times on the system for increased availability and reliability 55

56 MapReduce Impact? bringing commodity big data processing to a broad audience in the same way the commodity LAMP stack changed the landscape of web applications to WEB

57 MapReduce: Very high level overview The key innovation of MapReduce is the ability to take a query over a data set, divide it, and run it in parallel over many nodes. Solves the issue of data too large to fit onto a single machine Distributed computing over many servers Batch processing model Two phases Map phase, input data is processed, item by item, and transformed into an intermediate data set. Reduce phase, these intermediate results are reduced to a summarized data set, which is the desired end result. 57

58 MapReduce: Very high level overview process data in a batch-oriented fashion and may take minutes or hours to process (normally). Source: TDWI.org 58

59 MapReduce: Very high level overview Three distinct operations: Loading the data This operation is more properly called Extract, Transform, Load (ETL) in data warehousing terminology. Data must be extracted from its source, structured to make it ready for processing, and loaded into the storage layer for MapReduce to operate on it. MapReduce This phase will retrieve data from storage, process it (map, collect and sort map results, reduce) and return the results to the storage. Extracting the result Once processing is complete, for the result to be useful, it must be retrieved from the storage and presented. 59

60 MapReduce: Very high level overview MapReduce Programming Model Data type: key-value records Map function: (K in, V in ) list(k inter, V inter ) Reduce function: (K inter, list(v inter )) list(k out, V out ) 60

61 MapReduce Map step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. Map(k1,v1) list(k2,v2) Reduce step: The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve. Reduce(k2, list (v2)) list(v3) Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 61

62 MapReduce: Very high level overview Programming Model map / reduce functions Suitable for embarrassingly parallel problem. Distributed Computing Framework Clusters of commodity hardware Large datasets Fault tolerant Splits jobs into a number of smaller tasks Move code to data (local computation) Allow programs to scale transparently input size Abstract away fault tolerance, synchronization, 62

63 MapReduce By providing a data-parallel programming model, MapReduce can control job execution in useful ways: Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers User focuses on application, not on complexities of distributed computing 63

64 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 64

65 Example: Word Count Assume you have a cluster of 50 computers, each with an attached local disk and half full of web pages. What is a simple parallel programming framework that would support the computation of word counts? 65

66 Example: Word Count Basic Pattern: Strings 1. Extract words from web pages in parallel. 2. Hash and sort words. 3. Count in parallel. 66

67 MapReduce example Input is files with one document per record User specifies map function Input of map Output of map it was the best of times 67 it, 1 was, 1 the, 1 best, 1

68 MapReduce example (cont) MapReduce library gathers together all pairs with the same key value (shuffle/sort phase) The user-defined reduce function combines all the values associated with the same key Ex: Input of reduce key = it values = 1, 1 Output of reduce key = was values = 1, 1 it, 2 was, 2 best, 1 worst, 1 key = best values = 1 key = worst values = 1 68

69 E.g. Common wordcount Hello World Hello MapReduce Fig1: Sample input Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 69

70 E.g. Common wordcount Hello World Hello MapReduce Input MAP Hello, 1 World, 1 First intermediate output Hello, 1 MapReduce, 1 REDUCE Hello, 2 MapReduce, 1 World, 1 Final output Second intermediate output Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 70

71 E.g. Common wordcount void map(string i, string line): for word in line: print word, 1 Fig 2: wordcount map function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez March 2012

72 E.g. Common wordcount void reduce(string word, list partial_counts): total = 0 for c in partial_counts: total += c print word, total Fig 3: wordcount reduce function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 72

73 Why is word count example important? It is one of the most important examples for the type of text processing often done with MapReduce. There is an important mapping document < > data record words < > (field, value) 73

74 Other examples applications Search: Input: (linenumber, line) records Output: lines matching a given pattern Map: if(line matches pattern): output(line) Reduce: identify function Alternative: no reducer (map-only job) 74

75 Other examples applications Inverted index: Input: (filename, text) records Output: list of files containing each word Map: foreach word in text.split(): output(word, filename) Reduce: def reduce(word, filenames): output(word, sort(filenames)) 75

76 Other examples applications Inverted index: hamlet.txt to be or not to be 12th.txt be not afraid of greatness to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt be, 12th.txt not, 12th.txt afraid, 12th.txt of, 12th.txt greatness, 12th.txt (sort) afraid, (12th.txt) be, (12th.txt, hamlet.txt) greatness, (12th.txt) not, (12th.txt, hamlet.txt) of, (12th.txt) or, (hamlet.txt) to, (hamlet.txt) 76

77 When to use MapReduce? Does the problem I am trying to solve decompose into Map and Reduce operation? MapReduce works on any problem that is made up of exactly 2 functions at some level of abstraction: Map: Execute the same operation on all data in the input set Reduce: Execute the same operation on each group of data produced by Map There are a class of algorithms that cannot be efficiently implemented with the MapReduce programming model 77

78 Programming for distributed/parallel systems Main challenges and bottlenecks: Data exchange requires synchronization Bandwidth limitations Temporal dependencies Failures (of the systems performed by commodity hardware) Dealing with multiple parallel computing resources and distributed data resources Different programming models deal with different challenges 78

79 Programming for distributed/parallel systems If the main challenge is to deal with partial failures: MapReduce programming model MapReduce allows easy programming with the expectation of failure Example: assume that you are searching a cluster of servers and one is unable to respond at that moment. What mapreduce will do since it could not access that tree node to the larger Map it will reschedule it for later and perform either the Map or the Reduce then. Essentially it tries to guarantee all information is available with the unpredictability of software and hardware in environments. 79

80 Programming for distributed/parallel systems Main challenges and bottlenecks: Data exchange requires synchronization Bandwidth limitations Temporal dependencies Failures (of the systems performed by commodity hardware) Dealing with multiple parallel computing resources and distributed data resources Different programming models deal with different challenges 80

81 Programming for distributed/parallel systems If the main challenge is to deal with dependencies: Use other programming models E.g. OmpSs/COMPSs programming model Reduce programming parallel applications complexity in complex platforms (multicores/gpus, distributed computing, Clouds) Based on traditional programming languages (C/C++, Java, Fortran) and sequential programming Task based Intelligent runtime Builds a task dependence graph based on directionality hints given by the programmer Perform scheduling a resource management, exploiting potential parallelism Automatic data transfers, exploiting data locality 81

82 OmpSs/COMPSs vs MapReduce OmpSs/COMPSs MapReduce Input Data Input Data Mappers Reducers Output Data Output Data = compute node 82

83 OmpSs/COMPSs vs MapReduce OmpSs/COMPSs MapReduce Data Arbitrary (key, value) pairs structure Functions Arbitrary Map & Reduce Middleware BSC middleware Ease of use Low Hadoop Medium Scope Wide Narrow Graph structure Dynamic Directed Acyclic Graph Two-level Inverted Tree 83

84 Conclusions MapReduce programming model hides the complexity of work distribution and fault tolerance Principal design philosophies: Make it scalable, so you can throw hardware at problems Make it cheap, lowering hardware, programming and admin costs MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time Cloud computing makes it straightforward to start using Hadoop (or other parallel software) at scale 84

85 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 85

86 Hadoop MapReduce Hadoop is the dominant open source MapReduce implementation Funded by Yahoo, it emerged in 2006 The Hadoop project is now hosted by Apache Implemented in Java, (The data to be processed must be loaded into e.g. the Hadoop Distributed Filesystem) Source: Wikipedia 86

87 Hadoop MapReduce Hadoop is an open source MapReduce runtime provided by the Apache Software Foundation De-facto standard, free, open-source MapReduce implementation. Endorsed by: 87

88 Hadoop - Architecture Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 88

89 Hadoop: Very high-level overview When data is loaded into the systems, it is split into blocks of 64Mb/128Mb Map tasks works on typically a single block A master program allocates work to nodes (that work in parallel) such that a Map task will work on a block of data stored locally on that node If a node fails, the master will detect that failure and re-assign the work to a different node on the system 89

90 Hadoop esentials Computation: Move the computation to the data Storage: Keeping track of the data and metadata Data is sharded across the cluster Cluster management tools... 90

91 (default) Hadoop s Stack Applications more detail in next part!!! Compute Services Data Services Storage Services Hadoop s MapReduce Hbase: NoSQL Databases Hadoop Distributed File System (HDFS) Resource Fabrics 91 91

92 Basic Cluster Components One of each: Namenode (NN) Jobtracker (JT) Set of each per slave machine: Tasktracker (TT) Datanode (DN) 92

file system slave node tasktracker datanode daemon Linux file

93 Putting Everything Together namenode job submission node namenode daemon jobtracker tasktracker datanode daemon Linux file system slave node tasktracker datanode daemon Linux file system slave node tasktracker datanode daemon Linux file system slave node 93

94 Anatomy of a Job MapReduce program in Hadoop = Hadoop job Jobs are divided into map and reduce tasks An instance of running a task is called a task attempt Multiple jobs can be composed into a workflow 94

95 Anatomy of a Job Job submission process Client (i.e., driver program) creates a job, configures it, and submits it to job tracker JobClient computes input splits (on client end) Job data (jar, configuration XML) are sent to JobTracker JobTracker puts job data in shared location, enqueues tasks TaskTrackers poll for tasks Off to the races 95

96 Running MapReduce job with Hadoop Steps: Defining the MapReduce stages in a Java program Loading the data into the Hadoop Distributed Filesystem Submitting the job for execution Retrieving the results from the filesystem MapReduce has been implemented in a variety of other programming languages and systems, Several NoSQL database systems have integrated MapReduce (later in this course) 96

97 Is Hadoop really that big a deal? Yes. According to a survey (*) from July, 2011: 54%of organizations surveyed are using or considering Hadoop Over 82% users benefit from faster analyses and better utilization of computing resources 94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data Organizations use Hadoop in particular to work with unstructured data such as logs and event data (63%) (*) Research-Survey-Shows-Organizations-Hadoop-Perform 97

98 Hadoop and enterprise? Hadoop is a complement to a relational data warehouse Enterprises are generally not replacing their relational DataWarehouse with Hadoop Hadoop s strengths Inexpensive High reliability Extreme scalability Flexibility: Data can be added without defining a schema Hadoop s weaknesses Hadoop is not an interactive query environment Processing data in Hadoop requires writing code 98

99 Using MapReduce in the Enterprise 99

100 Who is using Hadoop? ebay is using Hadoop for search optimizing and research via a huge-node cluster. Facebook has two big Hadoop clusters for storing internal log and data sources and for data reporting, analytics and machine learning. Twitter uses Hadoop to store and process all it tweets and other data types generated on the social networking system. Yahoo has a Hadoop cluster of 4,500 nodes for research efforts around its ad systems and Web servers. It's also using it for scaling tests to drive Hadoop development on bigger clusters. Source: Enterprise/ ( ) 100

101 Who is using Hadoop? Source: Wikipedia, April

102 What is MapReduce model used for? At Google: Index construction for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: Web map powering Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection 102

103 Hadoop : The Apache Software Foundation delivers Hadoop 1.0, the much-anticipated 1.0 version of the popular open-source platform for storing and processing large amounts of data. six years of development, production experience, extensive testing, and feedback from hundreds of knowledgeable users, data scientists and systems engineers, culminating in a highly stable, enterprise-ready release of the fastest-growing big data platform. 103

104 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 104

105 Amazon Elastic MapReduce Provides a web-based interface and command-line tools for running Hadoop jobs on Amazon EC2 Data stored in Amazon S3 Monitors job and shuts down machines after use Small extra charge on top of EC2 pricing If you want more control over how you Hadoop runs, you can launch a Hadoop cluster on EC2 manually using the scripts in src/contrib/ec2 105

106 Amazon Elastic MapReduce 106

107 Elastic MapReduce Workflow 107

108 Elastic MapReduce Workflow 108

109 Elastic MapReduce Workflow 109

110 Elastic MapReduce Workflow 110

111 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop and Homework 3! 111

112 Getting Started with Hadoop 112

113 Getting Started with Hadoop Different ways to write jobs: Java API Hadoop Streaming (for Python, Perl, etc) Pipes API (C++) R 113

114 Hadoop API Different APIs to write Hadoop programs: A rich Java API (main way to write Hadoop programs) A Streaming API that can be used to write map and reduce functions in any programming language (using standard inputs and outputs) A C++ API (Hadoop Pipes) With a higher language level (e.g., Pig, Hive) 114

115 Hadoop API Mapper void map(k1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) void configure(jobconf job) void close() throws IOException Reducer/Combiner void reduce(k2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) void configure(jobconf job) void close() throws IOException Partitioner void getpartition(k2 key, V2 value, int numpartitions) 115

116 WordCount.java package org.myorg; import java.io.ioexception; import java.util.*; import org.apache.hadoop.fs.path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { } 116

117 WordCount.java public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); } } } 117

118 WordCount.java public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } 118

119 WordCount.java public static void main(string[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } 119

120 E.g. Common wordcount Hello World Hello MapReduce Fig1: Sample input Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 120

121 E.g. Common wordcount void map(string i, string line): for word in line: print word, 1 Fig 2: wordcount map function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez March 2012

122 E.g. Common wordcount void reduce(string word, list partial_counts): total = 0 for c in partial_counts: total += c print word, total Fig 3: wordcount reduce function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 122

123 E.g. Common wordcount Hello World Hello MapReduce Input MAP Hello, 1 World, 1 First intermediate output Hello, 1 MapReduce, 1 REDUCE Hello, 2 MapReduce, 1 World, 1 Final output Second intermediate output Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 123

124 Word Count Python Mapper def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) Source: Robert Grossman Tutorial Supercomputing 2011

125 Word Count R Mapper trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) words <- splitintowords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con) Source: Robert Grossman Tutorial Supercomputing 2011

126 Word Count Java Mapper public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context throws IOException, InterruptedException { } } String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } Source: Robert Grossman Tutorial Supercomputing 2011

127 Code Comparison Word Count Mapper Python def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) Java public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context throws IOException, InterruptedException { } } String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } R trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) words <- splitintowords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con) Source: Robert Grossman Tutorial Supercomputing 2011

128 Word Count Python Reducer def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) Source: Robert Grossman Tutorial Supercomputing 2011

129 Word Count R Reducer trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) splitline <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) split <- splitline(line) word <- split$word count <- split$count Source: Robert Grossman Tutorial Supercomputing 2011

130 Word Count R Reducer (cont d) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = " )

131 Word Count Java Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { } int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum));

132 Code Comparison Word Count Reducer Python def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = " ) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { R public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) splitline <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) split <- splitline(line) word <- split$word count <- split$count } } int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); Source: Robert Grossman Tutorial Supercomputing 2011

133 HOMEWORK 3: Groups of 2 or 3 students Vowels Count program * You can use: Amazon Eslastic MapReduce your own local hadoop installation Presentation day: Monday 02/12/2013 One/Two groups will be randomly choosen Slides (or web page): Hands-on stile (*) we could agree another program 133

134 Some Resources Hadoop, the definitive guide. Tom White. O Really, 2012 Hadoop: Video tutorials Cloudera: Video tutorials MapR: Amazon Elastic MapReduce guide: GettingStartedGuide/ Slides from PARLab: Parallel Boot Camp Cloud Computing with MapReduce and Hadoop by Matei Zaharia Electrical Engineering and Computer Sciences University of California, Berkeley ootcamp_clouds/

MapReduce programming model

MapReduce programming model technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate