3. Big Data Processing

Size: px
Start display at page:

Download "3. Big Data Processing"

Transcription

1 3. Big Data Processing Cloud Computing & Big Data MASTER ENGINYERIA INFORMÀTICA FIB/UPC Fall Jordi Torres, UPC - BSC

2 Slides are only for presentation guide We will discuss+debate additional concepts/ideas appeared during your participation! (and we could skip part of the content) FEEL FREE TO PARTICIPATE!

3 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 3

4 My motivation for including it now! Cloud Computing? Cloud Computing & Big Data Big Data has become a hot topic in the field of Information and Communication Technology (ICT) in recent years, impossible to separate it from Cloud Computing. Particularly, the exponential and fast growth of very different types of data has quickly raised concerns about how to store, manage, process and analyse the data. For these reasons I considered that this topic have to be included in this course. I hope you enjoy it! :-) 4

5 Motivation 5

6 Motivation 6

7 Motivation Source: 7

8 Motivation Cloud Computing Big Data 8

9 Big Data? Do you need a definition? is data that becomes large enough that it cannot be processed using conventional methods. enough for you? :-) Petabytes of data created daily social networks, mobile phones, sensors, science, Source: 11/06/28/digital-universe-to-add-1-8-zettabytes-in- 2011/?utm-source=feedburner&utm-medium=feed&utmcampaign=Feed:+DataCenterKnowledge+%28Data 9

10 SOURCE: 10

11 SOURCE: 11

12 SOURCE: 12

13 Internet of Things 13

14 Future of Cloud: Fog Computing? 14

15 New requirements for Cloud: For example: Barcelona Smart City 15

16 Why is Big Data Important 40% projected growth in global data generated per year vs 5% projected growth in global IT spending 60% Potential increase in retailers operating margins possible with big data (*) Source: Big Data: The next frontier for innovation, competition and productivity Mckinsey Global Institute, July

17 Why is Big Data Important Data is more important than ever, but the exponential growth of data has overwhelmed most company's ability to manage (and monetize it). BIG MARKET FOR NEW COMPANIES: Your company? 17

18 Big Data Source: Data Scale Exa Peta Tera Giga Mega Kilo Up to 10,000 Times larger Data at Rest Traditional Data Warehouse and Business Intelligence Data in Motion yr mo wk day hr min sec ms µs Occasional Frequent Real-time Decision Frequency 18

19 My definition :-) Big Data is data that exceeds the storing, processing and managing capacity of conventional systems. The reason is that the data is too big, moves too fast, or doesn t fit the structures of our current systems architectures. Moreover, to gain value from this data, we must change the way to analyze it. 19

20 Big Data VOLUME Terabytes? Petabytes? Exabytes? Zettabytes? 20

21 1 Gigabyte (GB) = byte 1 Terabyte (TB) = Gigabyte (GB) 1 Petabyte (PB) = Gigabyte (GB) 1 Exabyte (EB) = Gigabyte (GB) 1 Zettabyte (ZB) = (GB) 21

22 Big Data: VARIETY Source: Toni Brey Urbiotica.com 22

23 Big Data: VARIETY Data Growth is Increasingly Unstructured Structured Data containing a defined data type, format, structure E.g. Transactional Data Base Semi-Structured Textual data files with a discernable pattern, enabling parsing E.g. XML data file + xml schema Quasi Structured Textual data with erratic data formats E. g. Web clickstream (may contain inconsistencies) Unstructured No inherent structure and different types of files E.g. PDFs, images, videos. 23

24 Big Data: VELOCITY Real-time required 24

25 Summary: Volume: Large Volumes of data Terabytes, Petabytes, Data that cannot be stored in conventional RDBMS Variety: Source data is diverse Web Logs, Application Logs, Machine generated data, Social network data, etc. Doesn't fall into neat relational structures Unstructured, Semistructured Velocity: Streaming data, Complex Event Processing data Velocity of incoming data and Speed of responding to it 25

26 Big Data VOLUME + VARIETY + VELOCITY BIG DATA CHALLENGES? 26

27 Big Data Challenges Data storage Data processing Data management Data analysis and (not include in this course :-P) DEBATE!!! 27

28 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 28

29 Current constraints of conventional IT Execution Time new requirements for real-time decisions Conventional Systems Interactive or realtime query for large datasets is seen as a key to analyst productivity (realtime as in query times fast enough to keep the user in the flow of analysis, from sub-second to less than a few minutes). GBs Data Volume PBs 29

30 Today IT technology The existing large-scale data management schemes aren t fast enough and reduce analytical effectiveness when users can t explore the data by quickly iterating through various query schemes. MEMORY (= fast, expensive, volatile ) STORAGE (= slow, cheap, non-volatile) HHD 100 cheaper than RAM But 1000 times slower 30

31 New proposals: in-memory Execution time research In-memory GBs PBs 31 31

32 In-memory optimizations BI example: SOURCE: 32

33 In-memory optimizations results SOURCE: 33

34 Some of the current in-memory tools: We see companies with large data stores building out their own in-memory tools, e.g., Dremel at Google, Druid at Metamarkets, and Sting at Netflix, and new tools, like Cloudera s Impala announcement at the conference, UC Berkeley s AMPLab s Spark, SAP Hana, and Platfora Source: html?imm_mid=09b70d&cmp=em-strata-newsletters-nov14-direct#more

35 In-memory: new storage tech required Data storage challenges: Present solutions: Research: Solid- state drive (SSD) Not volatile Storage Class Memory (SCM) 35 Source: ano_devices/memoryaddressing/

36 SCM candidates: should be non-volatile, low-cost, high performance, high reliable solid-state memory Currently available Flash technology falls short of these requirements Some new type of SCM technology need to emerge (not my expertise! :-( Some Candidates Improved Flash FeRAM (Ferroelectric RAM) MRAM (Magnetic RAM) Phase Change Memory RRAM (Resistive RAM) Solid Electrolyte Organic and Polymeric Memory 36

37 Debate Old Compute-centric Model Manycore FPGA New Data-centric Model Massive Parallelism Persistent Memory Flash Phase Change Data lives on disk and tape Move data to CPU as needed Deep Storage Hierarchy Data lives in persistent memory Many CPU s surround and use Shallow/Flat Storage Hierarchy Source: Heiko Joerg

38 Debate Source: 38

39 Debate Source: 39

40 Debate Source: 40

41 Debate Source: 41

42 Debate Remote DMA Source: 42

43 Debate Source: 43

44 Debate Source: 44

45 Debate: Big Data storage HHD 100 cheaper than RAM But 1000 times slower Source: 45

46 Debate: Big Data storage Source: 46

47 Market: Infiniband to the public cloud? 47

48 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 48

49 Traditional Large-Scale Computation Traditionally: The primary push is to increase the computation power of a single machine Processor-bound workloads Normally with small amount of data Complex processing performed on that data 49

50 Traditional Large-Scale Computation Moore s Law: roughly stated, processing power doubles every two years it is not enough! Distributed Systems evolved to allow developers to use multiple machines for a single job MPI OpenMP OmpSs/COMPSs 50

51 Distributed Computing: Commodity hardware Supercomputers are not affordable for everybody Commodity hardware Programming for traditional distributed systems is complex (main challenges and bottlenecks) Data exchange requires synchronization Bandwidth limitations Temporal dependencies It is difficult to deal with partial failures of the system performed by commodity hardware 51

52 Challenges: Failure Failure is the defining difference between distributed and local programming Design distributed systems with the expectation of failure Problem: Developers spend more time designing for failure than they do actually working on the problem itself A new approach for Failure is needed!!! 52

53 Challenges: Data (reminder) Typically in a traditional Large-Scale Computation Data is stored on a Storage Area Network At compute time, data is copied to the compute nodes Fine for relatively few amounts of data However as we discussed, modern systems have to deal with far more data than was the case in the past Getting the data to the processors becomes the bottleneck Ex: Transfer 50 Gb at 75Mb/sec (typical disk rate transfer) 11 minutes aprox. Ex: Transfer 1 Tb ( 1000 Gb)???? A new approach for Data is needed!!! 53

54 New approach required DATA: Not move the data at compute time, have the data already there FAILURE: The system must support partial failure If a component of the system fails: Do not require a full restart of the entire system CONSISTENCY: Component failures during execution of a job should not affect the outcome of the job SCALABILITY Increasing resources should support a proportional increase in load capacity 54

55 MapReduce To meet the challenges: MapReduce Programming Model introduced by Google in early 2000s to support distributed computing on large data sets on clusters of computers Applications are written in high-level programming model: Developpers do not worry about network programming, temporal dependencies, etc. This work takes a radical new approach to the problem of distributed computing Distribute the data as it is initially stored in the system Data is replicated multiples times on the system for increased availability and reliability 55

56 MapReduce Impact? bringing commodity big data processing to a broad audience in the same way the commodity LAMP stack changed the landscape of web applications to WEB

57 MapReduce: Very high level overview The key innovation of MapReduce is the ability to take a query over a data set, divide it, and run it in parallel over many nodes. Solves the issue of data too large to fit onto a single machine Distributed computing over many servers Batch processing model Two phases Map phase, input data is processed, item by item, and transformed into an intermediate data set. Reduce phase, these intermediate results are reduced to a summarized data set, which is the desired end result. 57

58 MapReduce: Very high level overview process data in a batch-oriented fashion and may take minutes or hours to process (normally). Source: TDWI.org 58

59 MapReduce: Very high level overview Three distinct operations: Loading the data This operation is more properly called Extract, Transform, Load (ETL) in data warehousing terminology. Data must be extracted from its source, structured to make it ready for processing, and loaded into the storage layer for MapReduce to operate on it. MapReduce This phase will retrieve data from storage, process it (map, collect and sort map results, reduce) and return the results to the storage. Extracting the result Once processing is complete, for the result to be useful, it must be retrieved from the storage and presented. 59

60 MapReduce: Very high level overview MapReduce Programming Model Data type: key-value records Map function: (K in, V in ) list(k inter, V inter ) Reduce function: (K inter, list(v inter )) list(k out, V out ) 60

61 MapReduce Map step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. Map(k1,v1) list(k2,v2) Reduce step: The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve. Reduce(k2, list (v2)) list(v3) Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 61

62 MapReduce: Very high level overview Programming Model map / reduce functions Suitable for embarrassingly parallel problem. Distributed Computing Framework Clusters of commodity hardware Large datasets Fault tolerant Splits jobs into a number of smaller tasks Move code to data (local computation) Allow programs to scale transparently input size Abstract away fault tolerance, synchronization, 62

63 MapReduce By providing a data-parallel programming model, MapReduce can control job execution in useful ways: Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers User focuses on application, not on complexities of distributed computing 63

64 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 64

65 Example: Word Count Assume you have a cluster of 50 computers, each with an attached local disk and half full of web pages. What is a simple parallel programming framework that would support the computation of word counts? 65

66 Example: Word Count Basic Pattern: Strings 1. Extract words from web pages in parallel. 2. Hash and sort words. 3. Count in parallel. 66

67 MapReduce example Input is files with one document per record User specifies map function Input of map Output of map it was the best of times 67 it, 1 was, 1 the, 1 best, 1

68 MapReduce example (cont) MapReduce library gathers together all pairs with the same key value (shuffle/sort phase) The user-defined reduce function combines all the values associated with the same key Ex: Input of reduce key = it values = 1, 1 Output of reduce key = was values = 1, 1 it, 2 was, 2 best, 1 worst, 1 key = best values = 1 key = worst values = 1 68

69 E.g. Common wordcount Hello World Hello MapReduce Fig1: Sample input Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 69

70 E.g. Common wordcount Hello World Hello MapReduce Input MAP Hello, 1 World, 1 First intermediate output Hello, 1 MapReduce, 1 REDUCE Hello, 2 MapReduce, 1 World, 1 Final output Second intermediate output Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 70

71 E.g. Common wordcount void map(string i, string line): for word in line: print word, 1 Fig 2: wordcount map function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez March 2012

72 E.g. Common wordcount void reduce(string word, list partial_counts): total = 0 for c in partial_counts: total += c print word, total Fig 3: wordcount reduce function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 72

73 Why is word count example important? It is one of the most important examples for the type of text processing often done with MapReduce. There is an important mapping document < > data record words < > (field, value) 73

74 Other examples applications Search: Input: (linenumber, line) records Output: lines matching a given pattern Map: if(line matches pattern): output(line) Reduce: identify function Alternative: no reducer (map-only job) 74

75 Other examples applications Inverted index: Input: (filename, text) records Output: list of files containing each word Map: foreach word in text.split(): output(word, filename) Reduce: def reduce(word, filenames): output(word, sort(filenames)) 75

76 Other examples applications Inverted index: hamlet.txt to be or not to be 12th.txt be not afraid of greatness to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt be, 12th.txt not, 12th.txt afraid, 12th.txt of, 12th.txt greatness, 12th.txt (sort) afraid, (12th.txt) be, (12th.txt, hamlet.txt) greatness, (12th.txt) not, (12th.txt, hamlet.txt) of, (12th.txt) or, (hamlet.txt) to, (hamlet.txt) 76

77 When to use MapReduce? Does the problem I am trying to solve decompose into Map and Reduce operation? MapReduce works on any problem that is made up of exactly 2 functions at some level of abstraction: Map: Execute the same operation on all data in the input set Reduce: Execute the same operation on each group of data produced by Map There are a class of algorithms that cannot be efficiently implemented with the MapReduce programming model 77

78 Programming for distributed/parallel systems Main challenges and bottlenecks: Data exchange requires synchronization Bandwidth limitations Temporal dependencies Failures (of the systems performed by commodity hardware) Dealing with multiple parallel computing resources and distributed data resources Different programming models deal with different challenges 78

79 Programming for distributed/parallel systems If the main challenge is to deal with partial failures: MapReduce programming model MapReduce allows easy programming with the expectation of failure Example: assume that you are searching a cluster of servers and one is unable to respond at that moment. What mapreduce will do since it could not access that tree node to the larger Map it will reschedule it for later and perform either the Map or the Reduce then. Essentially it tries to guarantee all information is available with the unpredictability of software and hardware in environments. 79

80 Programming for distributed/parallel systems Main challenges and bottlenecks: Data exchange requires synchronization Bandwidth limitations Temporal dependencies Failures (of the systems performed by commodity hardware) Dealing with multiple parallel computing resources and distributed data resources Different programming models deal with different challenges 80

81 Programming for distributed/parallel systems If the main challenge is to deal with dependencies: Use other programming models E.g. OmpSs/COMPSs programming model Reduce programming parallel applications complexity in complex platforms (multicores/gpus, distributed computing, Clouds) Based on traditional programming languages (C/C++, Java, Fortran) and sequential programming Task based Intelligent runtime Builds a task dependence graph based on directionality hints given by the programmer Perform scheduling a resource management, exploiting potential parallelism Automatic data transfers, exploiting data locality 81

82 OmpSs/COMPSs vs MapReduce OmpSs/COMPSs MapReduce Input Data Input Data Mappers Reducers Output Data Output Data = compute node 82

83 OmpSs/COMPSs vs MapReduce OmpSs/COMPSs MapReduce Data Arbitrary (key, value) pairs structure Functions Arbitrary Map & Reduce Middleware BSC middleware Ease of use Low Hadoop Medium Scope Wide Narrow Graph structure Dynamic Directed Acyclic Graph Two-level Inverted Tree 83

84 Conclusions MapReduce programming model hides the complexity of work distribution and fault tolerance Principal design philosophies: Make it scalable, so you can throw hardware at problems Make it cheap, lowering hardware, programming and admin costs MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time Cloud computing makes it straightforward to start using Hadoop (or other parallel software) at scale 84

85 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 85

86 Hadoop MapReduce Hadoop is the dominant open source MapReduce implementation Funded by Yahoo, it emerged in 2006 The Hadoop project is now hosted by Apache Implemented in Java, (The data to be processed must be loaded into e.g. the Hadoop Distributed Filesystem) Source: Wikipedia 86

87 Hadoop MapReduce Hadoop is an open source MapReduce runtime provided by the Apache Software Foundation De-facto standard, free, open-source MapReduce implementation. Endorsed by: 87

88 Hadoop - Architecture Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 88

89 Hadoop: Very high-level overview When data is loaded into the systems, it is split into blocks of 64Mb/128Mb Map tasks works on typically a single block A master program allocates work to nodes (that work in parallel) such that a Map task will work on a block of data stored locally on that node If a node fails, the master will detect that failure and re-assign the work to a different node on the system 89

90 Hadoop esentials Computation: Move the computation to the data Storage: Keeping track of the data and metadata Data is sharded across the cluster Cluster management tools... 90

91 (default) Hadoop s Stack Applications more detail in next part!!! Compute Services Data Services Storage Services Hadoop s MapReduce Hbase: NoSQL Databases Hadoop Distributed File System (HDFS) Resource Fabrics 91 91

92 Basic Cluster Components One of each: Namenode (NN) Jobtracker (JT) Set of each per slave machine: Tasktracker (TT) Datanode (DN) 92

93 Putting Everything Together namenode job submission node namenode daemon jobtracker tasktracker datanode daemon Linux file system slave node tasktracker datanode daemon Linux file system slave node tasktracker datanode daemon Linux file system slave node 93

94 Anatomy of a Job MapReduce program in Hadoop = Hadoop job Jobs are divided into map and reduce tasks An instance of running a task is called a task attempt Multiple jobs can be composed into a workflow 94

95 Anatomy of a Job Job submission process Client (i.e., driver program) creates a job, configures it, and submits it to job tracker JobClient computes input splits (on client end) Job data (jar, configuration XML) are sent to JobTracker JobTracker puts job data in shared location, enqueues tasks TaskTrackers poll for tasks Off to the races 95

96 Running MapReduce job with Hadoop Steps: Defining the MapReduce stages in a Java program Loading the data into the Hadoop Distributed Filesystem Submitting the job for execution Retrieving the results from the filesystem MapReduce has been implemented in a variety of other programming languages and systems, Several NoSQL database systems have integrated MapReduce (later in this course) 96

97 Is Hadoop really that big a deal? Yes. According to a survey (*) from July, 2011: 54%of organizations surveyed are using or considering Hadoop Over 82% users benefit from faster analyses and better utilization of computing resources 94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data Organizations use Hadoop in particular to work with unstructured data such as logs and event data (63%) (*) Research-Survey-Shows-Organizations-Hadoop-Perform 97

98 Hadoop and enterprise? Hadoop is a complement to a relational data warehouse Enterprises are generally not replacing their relational DataWarehouse with Hadoop Hadoop s strengths Inexpensive High reliability Extreme scalability Flexibility: Data can be added without defining a schema Hadoop s weaknesses Hadoop is not an interactive query environment Processing data in Hadoop requires writing code 98

99 Using MapReduce in the Enterprise 99

100 Who is using Hadoop? ebay is using Hadoop for search optimizing and research via a huge-node cluster. Facebook has two big Hadoop clusters for storing internal log and data sources and for data reporting, analytics and machine learning. Twitter uses Hadoop to store and process all it tweets and other data types generated on the social networking system. Yahoo has a Hadoop cluster of 4,500 nodes for research efforts around its ad systems and Web servers. It's also using it for scaling tests to drive Hadoop development on bigger clusters. Source: Enterprise/ ( ) 100

101 Who is using Hadoop? Source: Wikipedia, April

102 What is MapReduce model used for? At Google: Index construction for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: Web map powering Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection 102

103 Hadoop : The Apache Software Foundation delivers Hadoop 1.0, the much-anticipated 1.0 version of the popular open-source platform for storing and processing large amounts of data. six years of development, production experience, extensive testing, and feedback from hundreds of knowledgeable users, data scientists and systems engineers, culminating in a highly stable, enterprise-ready release of the fastest-growing big data platform. 103

104 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 104

105 Amazon Elastic MapReduce Provides a web-based interface and command-line tools for running Hadoop jobs on Amazon EC2 Data stored in Amazon S3 Monitors job and shuts down machines after use Small extra charge on top of EC2 pricing If you want more control over how you Hadoop runs, you can launch a Hadoop cluster on EC2 manually using the scripts in src/contrib/ec2 105

106 Amazon Elastic MapReduce 106

107 Elastic MapReduce Workflow 107

108 Elastic MapReduce Workflow 108

109 Elastic MapReduce Workflow 109

110 Elastic MapReduce Workflow 110

111 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop and Homework 3! 111

112 Getting Started with Hadoop 112

113 Getting Started with Hadoop Different ways to write jobs: Java API Hadoop Streaming (for Python, Perl, etc) Pipes API (C++) R 113

114 Hadoop API Different APIs to write Hadoop programs: A rich Java API (main way to write Hadoop programs) A Streaming API that can be used to write map and reduce functions in any programming language (using standard inputs and outputs) A C++ API (Hadoop Pipes) With a higher language level (e.g., Pig, Hive) 114

115 Hadoop API Mapper void map(k1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) void configure(jobconf job) void close() throws IOException Reducer/Combiner void reduce(k2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) void configure(jobconf job) void close() throws IOException Partitioner void getpartition(k2 key, V2 value, int numpartitions) 115

116 WordCount.java package org.myorg; import java.io.ioexception; import java.util.*; import org.apache.hadoop.fs.path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { } 116

117 WordCount.java public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); } } } 117

118 WordCount.java public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } 118

119 WordCount.java public static void main(string[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } 119

120 E.g. Common wordcount Hello World Hello MapReduce Fig1: Sample input Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 120

121 E.g. Common wordcount void map(string i, string line): for word in line: print word, 1 Fig 2: wordcount map function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez March 2012

122 E.g. Common wordcount void reduce(string word, list partial_counts): total = 0 for c in partial_counts: total += c print word, total Fig 3: wordcount reduce function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 122

123 E.g. Common wordcount Hello World Hello MapReduce Input MAP Hello, 1 World, 1 First intermediate output Hello, 1 MapReduce, 1 REDUCE Hello, 2 MapReduce, 1 World, 1 Final output Second intermediate output Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 123

124 Word Count Python Mapper def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) Source: Robert Grossman Tutorial Supercomputing 2011

125 Word Count R Mapper trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) words <- splitintowords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con) Source: Robert Grossman Tutorial Supercomputing 2011

126 Word Count Java Mapper public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context throws IOException, InterruptedException { } } String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } Source: Robert Grossman Tutorial Supercomputing 2011

127 Code Comparison Word Count Mapper Python def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) Java public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context throws IOException, InterruptedException { } } String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } R trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) words <- splitintowords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con) Source: Robert Grossman Tutorial Supercomputing 2011

128 Word Count Python Reducer def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) Source: Robert Grossman Tutorial Supercomputing 2011

129 Word Count R Reducer trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) splitline <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) split <- splitline(line) word <- split$word count <- split$count Source: Robert Grossman Tutorial Supercomputing 2011

130 Word Count R Reducer (cont d) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = " )

131 Word Count Java Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { } int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum));

132 Code Comparison Word Count Reducer Python def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = " ) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { R public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) splitline <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) split <- splitline(line) word <- split$word count <- split$count } } int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); Source: Robert Grossman Tutorial Supercomputing 2011

133 HOMEWORK 3: Groups of 2 or 3 students Vowels Count program * You can use: Amazon Eslastic MapReduce your own local hadoop installation Presentation day: Monday 02/12/2013 One/Two groups will be randomly choosen Slides (or web page): Hands-on stile (*) we could agree another program 133

134 Some Resources Hadoop, the definitive guide. Tom White. O Really, 2012 Hadoop: Video tutorials Cloudera: Video tutorials MapR: Amazon Elastic MapReduce guide: GettingStartedGuide/ Slides from PARLab: Parallel Boot Camp Cloud Computing with MapReduce and Hadoop by Matei Zaharia Electrical Engineering and Computer Sciences University of California, Berkeley ootcamp_clouds/

MapReduce programming model

MapReduce programming model MapReduce programming model technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate

More information

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School September 2012 This work is licensed

More information

Map-Reduce in Various Programming Languages

Map-Reduce in Various Programming Languages Map-Reduce in Various Programming Languages 1 Context of Map-Reduce Computing The use of LISP's map and reduce functions to solve computational problems probably dates from the 1960s -- very early in the

More information

Big Data landscape Lecture #2

Big Data landscape Lecture #2 Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13

More information

PARLab Parallel Boot Camp

PARLab Parallel Boot Camp PARLab Parallel Boot Camp Cloud Computing with MapReduce and Hadoop Matei Zaharia Electrical Engineering and Computer Sciences University of California, Berkeley What is Cloud Computing? Cloud refers to

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

Introduction to HDFS and MapReduce

Introduction to HDFS and MapReduce Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -

More information

Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme

Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme Big Data Analytics 4. Map Reduce I Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany original slides by Lucas Rego

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 2: Hadoop Nuts and Bolts Jimmy Lin University of Maryland Thursday, January 31, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

EE657 Spring 2012 HW#4 Zhou Zhao

EE657 Spring 2012 HW#4 Zhou Zhao EE657 Spring 2012 HW#4 Zhou Zhao Problem 6.3 Solution Referencing the sample application of SimpleDB in Amazon Java SDK, a simple domain which includes 5 items is prepared in the code. For instance, the

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Introduction to Hadoop

Introduction to Hadoop Hadoop 1 Why Hadoop Drivers: 500M+ unique users per month Billions of interesting events per day Data analysis is key Need massive scalability PB s of storage, millions of files, 1000 s of nodes Need cost

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor CS 470 Spring 2018 Mike Lam, Professor Parallel Algorithm Development (Foster's Methodology) Graphics and content taken from IPP section 2.7 and the following: http://www.mcs.anl.gov/~itf/dbpp/text/book.html

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018 Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Large-scale Information Processing

Large-scale Information Processing Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de Anecdotal evidence... I think there is a world market for about five computers,

More information

Distributed Systems CS6421

Distributed Systems CS6421 Distributed Systems CS6421 Intro to Distributed Systems and the Cloud Prof. Tim Wood v I teach: Software Engineering, Operating Systems, Sr. Design I like: distributed systems, networks, building cool

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (1/2) January 10, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Data Deluge. Billions of users connected through the net. Storage getting cheaper Store more data!

Data Deluge. Billions of users connected through the net. Storage getting cheaper Store more data! Hadoop 1 Data Deluge Billions of users connected through the net WWW, FB, twitter, cell phones, 80% of the data on FB was produced last year Storage getting cheaper Store more data! Why Hadoop Drivers:

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Spark and Cassandra. Solving classical data analytic task by using modern distributed databases. Artem Aliev DataStax

Spark and Cassandra. Solving classical data analytic task by using modern distributed databases. Artem Aliev DataStax Spark and Cassandra Solving classical data analytic task by using modern distributed databases Artem Aliev DataStax Spark and Cassandra Solving classical data analytic task by using modern distributed

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Teaching Map-reduce Parallel Computing in CS1

Teaching Map-reduce Parallel Computing in CS1 Teaching Map-reduce Parallel Computing in CS1 Richard Brown, Patrick Garrity, Timothy Yates Mathematics, Statistics, and Computer Science St. Olaf College Northfield, MN rab@stolaf.edu Elizabeth Shoop

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Introduction to Map/Reduce & Hadoop

Introduction to Map/Reduce & Hadoop Introduction to Map/Reduce & Hadoop Vassilis Christophides christop@csd.uoc.gr http://www.csd.uoc.gr/~hy562 University of Crete 1 Peta-Bytes Data Processing 2 1 1 What is MapReduce? MapReduce: programming

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox April 16 th, 2015 Emily Fox 2015 1 Document Retrieval n Goal: Retrieve

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014 Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Getting Started with Hadoop

Getting Started with Hadoop Getting Started with Hadoop May 28, 2018 Michael Völske, Shahbaz Syed Web Technology & Information Systems Bauhaus-Universität Weimar 1 webis 2018 What is Hadoop Started in 2004 by Yahoo Open-Source implementation

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Map- reduce programming paradigm

Map- reduce programming paradigm Map- reduce programming paradigm Some slides are from lecture of Matei Zaharia, and distributed computing seminar by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. In pioneer days they

More information

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1 Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L12: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture BIG DATA Architecture Hierarchy of knowledge Data: Element (fact, figure, etc.) which is basic information that can be to be based on decisions, reasoning, research and which is treated by the human or

More information

MapReduce-style data processing

MapReduce-style data processing MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Lab 11 Hadoop MapReduce (2)

Lab 11 Hadoop MapReduce (2) Lab 11 Hadoop MapReduce (2) 1 Giới thiệu Để thiết lập một Hadoop cluster, SV chọn ít nhất là 4 máy tính. Mỗi máy có vai trò như sau: - 1 máy làm NameNode: dùng để quản lý không gian tên (namespace) và

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File

More information

2013 AWS Worldwide Public Sector Summit Washington, D.C.

2013 AWS Worldwide Public Sector Summit Washington, D.C. 2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

PARALLEL DATA PROCESSING IN BIG DATA SYSTEMS

PARALLEL DATA PROCESSING IN BIG DATA SYSTEMS PARALLEL DATA PROCESSING IN BIG DATA SYSTEMS Great Ideas in ICT - June 16, 2016 Irene Finocchi (finocchi@di.uniroma1.it) Title keywords How big? The scale of things Data deluge Every 2 days we create as

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1 Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L3b: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand

More information

Top 25 Big Data Interview Questions And Answers

Top 25 Big Data Interview Questions And Answers Top 25 Big Data Interview Questions And Answers By: Neeru Jain - Big Data The era of big data has just begun. With more companies inclined towards big data to run their operations, the demand for talent

More information

September 2013 Alberto Abelló & Oscar Romero 1

September 2013 Alberto Abelló & Oscar Romero 1 duce-i duce-i September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate several use cases of duce 2. Describe what the duce environment is 3. Explain 6 benefits of using duce 4.

More information

Cloud Computing. Up until now

Cloud Computing. Up until now Cloud Computing Lecture 9 Map Reduce 2010-2011 Introduction Up until now Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing Distributed Scheduling 1 Outline Map Reduce:

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Introduction to Map/Reduce & Hadoop

Introduction to Map/Reduce & Hadoop Introduction to Map/Reduce & Hadoop V. CHRISTOPHIDES University of Crete & INRIA Paris 1 Peta-Bytes Data Processing 2 1 1 What is MapReduce? MapReduce: programming model and associated implementation for

More information

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc. MapReduce and HDFS This presentation includes course content University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents: Overview Why MapReduce? What

More information

INTRODUCTION TO HADOOP

INTRODUCTION TO HADOOP Hadoop INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 2 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = 10 21 Bytes)

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera, How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems

Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems Introduction to Hadoop Scott Seighman Systems Engineer Sun Microsystems 1 Agenda Identify the Problem Hadoop Overview Target Workloads Hadoop Architecture Major Components > HDFS > Map/Reduce Demo Resources

More information

Data Platforms and Pattern Mining

Data Platforms and Pattern Mining Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering,

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

Massive Online Analysis - Storm,Spark

Massive Online Analysis - Storm,Spark Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Internet Measurement and Data Analysis (13)

Internet Measurement and Data Analysis (13) Internet Measurement and Data Analysis (13) Kenjiro Cho 2016-07-11 review of previous class Class 12 Search and Ranking (7/4) Search systems PageRank exercise: PageRank algorithm 2 / 64 today s topics

More information

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2 Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer

More information

MapReduce and Hadoop

MapReduce and Hadoop Università degli Studi di Roma Tor Vergata MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference Big Data stack High-level Interfaces Data Processing

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example

More information

Large-Scale GPU programming

Large-Scale GPU programming Large-Scale GPU programming Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Assistant Adjunct Professor Computer and Information Science Dept. University

More information

Introduction to Map Reduce

Introduction to Map Reduce Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

Lecture 12 DATA ANALYTICS ON WEB SCALE

Lecture 12 DATA ANALYTICS ON WEB SCALE Lecture 12 DATA ANALYTICS ON WEB SCALE Source: The Economist, February 25, 2010 The Data Deluge EIGHTEEN months ago, Li & Fung, a firm that manages supply chains for retailers, saw 100 gigabytes of information

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information