3. Big Data Processing
|
|
- Pearl McDonald
- 5 years ago
- Views:
Transcription
1 3. Big Data Processing Cloud Computing & Big Data MASTER ENGINYERIA INFORMÀTICA FIB/UPC Fall Jordi Torres, UPC - BSC
2 Slides are only for presentation guide We will discuss+debate additional concepts/ideas appeared during your participation! (and we could skip part of the content) FEEL FREE TO PARTICIPATE!
3 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 3
4 My motivation for including it now! Cloud Computing? Cloud Computing & Big Data Big Data has become a hot topic in the field of Information and Communication Technology (ICT) in recent years, impossible to separate it from Cloud Computing. Particularly, the exponential and fast growth of very different types of data has quickly raised concerns about how to store, manage, process and analyse the data. For these reasons I considered that this topic have to be included in this course. I hope you enjoy it! :-) 4
5 Motivation 5
6 Motivation 6
7 Motivation Source: 7
8 Motivation Cloud Computing Big Data 8
9 Big Data? Do you need a definition? is data that becomes large enough that it cannot be processed using conventional methods. enough for you? :-) Petabytes of data created daily social networks, mobile phones, sensors, science, Source: 11/06/28/digital-universe-to-add-1-8-zettabytes-in- 2011/?utm-source=feedburner&utm-medium=feed&utmcampaign=Feed:+DataCenterKnowledge+%28Data 9
10 SOURCE: 10
11 SOURCE: 11
12 SOURCE: 12
13 Internet of Things 13
14 Future of Cloud: Fog Computing? 14
15 New requirements for Cloud: For example: Barcelona Smart City 15
16 Why is Big Data Important 40% projected growth in global data generated per year vs 5% projected growth in global IT spending 60% Potential increase in retailers operating margins possible with big data (*) Source: Big Data: The next frontier for innovation, competition and productivity Mckinsey Global Institute, July
17 Why is Big Data Important Data is more important than ever, but the exponential growth of data has overwhelmed most company's ability to manage (and monetize it). BIG MARKET FOR NEW COMPANIES: Your company? 17
18 Big Data Source: Data Scale Exa Peta Tera Giga Mega Kilo Up to 10,000 Times larger Data at Rest Traditional Data Warehouse and Business Intelligence Data in Motion yr mo wk day hr min sec ms µs Occasional Frequent Real-time Decision Frequency 18
19 My definition :-) Big Data is data that exceeds the storing, processing and managing capacity of conventional systems. The reason is that the data is too big, moves too fast, or doesn t fit the structures of our current systems architectures. Moreover, to gain value from this data, we must change the way to analyze it. 19
20 Big Data VOLUME Terabytes? Petabytes? Exabytes? Zettabytes? 20
21 1 Gigabyte (GB) = byte 1 Terabyte (TB) = Gigabyte (GB) 1 Petabyte (PB) = Gigabyte (GB) 1 Exabyte (EB) = Gigabyte (GB) 1 Zettabyte (ZB) = (GB) 21
22 Big Data: VARIETY Source: Toni Brey Urbiotica.com 22
23 Big Data: VARIETY Data Growth is Increasingly Unstructured Structured Data containing a defined data type, format, structure E.g. Transactional Data Base Semi-Structured Textual data files with a discernable pattern, enabling parsing E.g. XML data file + xml schema Quasi Structured Textual data with erratic data formats E. g. Web clickstream (may contain inconsistencies) Unstructured No inherent structure and different types of files E.g. PDFs, images, videos. 23
24 Big Data: VELOCITY Real-time required 24
25 Summary: Volume: Large Volumes of data Terabytes, Petabytes, Data that cannot be stored in conventional RDBMS Variety: Source data is diverse Web Logs, Application Logs, Machine generated data, Social network data, etc. Doesn't fall into neat relational structures Unstructured, Semistructured Velocity: Streaming data, Complex Event Processing data Velocity of incoming data and Speed of responding to it 25
26 Big Data VOLUME + VARIETY + VELOCITY BIG DATA CHALLENGES? 26
27 Big Data Challenges Data storage Data processing Data management Data analysis and (not include in this course :-P) DEBATE!!! 27
28 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 28
29 Current constraints of conventional IT Execution Time new requirements for real-time decisions Conventional Systems Interactive or realtime query for large datasets is seen as a key to analyst productivity (realtime as in query times fast enough to keep the user in the flow of analysis, from sub-second to less than a few minutes). GBs Data Volume PBs 29
30 Today IT technology The existing large-scale data management schemes aren t fast enough and reduce analytical effectiveness when users can t explore the data by quickly iterating through various query schemes. MEMORY (= fast, expensive, volatile ) STORAGE (= slow, cheap, non-volatile) HHD 100 cheaper than RAM But 1000 times slower 30
31 New proposals: in-memory Execution time research In-memory GBs PBs 31 31
32 In-memory optimizations BI example: SOURCE: 32
33 In-memory optimizations results SOURCE: 33
34 Some of the current in-memory tools: We see companies with large data stores building out their own in-memory tools, e.g., Dremel at Google, Druid at Metamarkets, and Sting at Netflix, and new tools, like Cloudera s Impala announcement at the conference, UC Berkeley s AMPLab s Spark, SAP Hana, and Platfora Source: html?imm_mid=09b70d&cmp=em-strata-newsletters-nov14-direct#more
35 In-memory: new storage tech required Data storage challenges: Present solutions: Research: Solid- state drive (SSD) Not volatile Storage Class Memory (SCM) 35 Source: ano_devices/memoryaddressing/
36 SCM candidates: should be non-volatile, low-cost, high performance, high reliable solid-state memory Currently available Flash technology falls short of these requirements Some new type of SCM technology need to emerge (not my expertise! :-( Some Candidates Improved Flash FeRAM (Ferroelectric RAM) MRAM (Magnetic RAM) Phase Change Memory RRAM (Resistive RAM) Solid Electrolyte Organic and Polymeric Memory 36
37 Debate Old Compute-centric Model Manycore FPGA New Data-centric Model Massive Parallelism Persistent Memory Flash Phase Change Data lives on disk and tape Move data to CPU as needed Deep Storage Hierarchy Data lives in persistent memory Many CPU s surround and use Shallow/Flat Storage Hierarchy Source: Heiko Joerg
38 Debate Source: 38
39 Debate Source: 39
40 Debate Source: 40
41 Debate Source: 41
42 Debate Remote DMA Source: 42
43 Debate Source: 43
44 Debate Source: 44
45 Debate: Big Data storage HHD 100 cheaper than RAM But 1000 times slower Source: 45
46 Debate: Big Data storage Source: 46
47 Market: Infiniband to the public cloud? 47
48 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 48
49 Traditional Large-Scale Computation Traditionally: The primary push is to increase the computation power of a single machine Processor-bound workloads Normally with small amount of data Complex processing performed on that data 49
50 Traditional Large-Scale Computation Moore s Law: roughly stated, processing power doubles every two years it is not enough! Distributed Systems evolved to allow developers to use multiple machines for a single job MPI OpenMP OmpSs/COMPSs 50
51 Distributed Computing: Commodity hardware Supercomputers are not affordable for everybody Commodity hardware Programming for traditional distributed systems is complex (main challenges and bottlenecks) Data exchange requires synchronization Bandwidth limitations Temporal dependencies It is difficult to deal with partial failures of the system performed by commodity hardware 51
52 Challenges: Failure Failure is the defining difference between distributed and local programming Design distributed systems with the expectation of failure Problem: Developers spend more time designing for failure than they do actually working on the problem itself A new approach for Failure is needed!!! 52
53 Challenges: Data (reminder) Typically in a traditional Large-Scale Computation Data is stored on a Storage Area Network At compute time, data is copied to the compute nodes Fine for relatively few amounts of data However as we discussed, modern systems have to deal with far more data than was the case in the past Getting the data to the processors becomes the bottleneck Ex: Transfer 50 Gb at 75Mb/sec (typical disk rate transfer) 11 minutes aprox. Ex: Transfer 1 Tb ( 1000 Gb)???? A new approach for Data is needed!!! 53
54 New approach required DATA: Not move the data at compute time, have the data already there FAILURE: The system must support partial failure If a component of the system fails: Do not require a full restart of the entire system CONSISTENCY: Component failures during execution of a job should not affect the outcome of the job SCALABILITY Increasing resources should support a proportional increase in load capacity 54
55 MapReduce To meet the challenges: MapReduce Programming Model introduced by Google in early 2000s to support distributed computing on large data sets on clusters of computers Applications are written in high-level programming model: Developpers do not worry about network programming, temporal dependencies, etc. This work takes a radical new approach to the problem of distributed computing Distribute the data as it is initially stored in the system Data is replicated multiples times on the system for increased availability and reliability 55
56 MapReduce Impact? bringing commodity big data processing to a broad audience in the same way the commodity LAMP stack changed the landscape of web applications to WEB
57 MapReduce: Very high level overview The key innovation of MapReduce is the ability to take a query over a data set, divide it, and run it in parallel over many nodes. Solves the issue of data too large to fit onto a single machine Distributed computing over many servers Batch processing model Two phases Map phase, input data is processed, item by item, and transformed into an intermediate data set. Reduce phase, these intermediate results are reduced to a summarized data set, which is the desired end result. 57
58 MapReduce: Very high level overview process data in a batch-oriented fashion and may take minutes or hours to process (normally). Source: TDWI.org 58
59 MapReduce: Very high level overview Three distinct operations: Loading the data This operation is more properly called Extract, Transform, Load (ETL) in data warehousing terminology. Data must be extracted from its source, structured to make it ready for processing, and loaded into the storage layer for MapReduce to operate on it. MapReduce This phase will retrieve data from storage, process it (map, collect and sort map results, reduce) and return the results to the storage. Extracting the result Once processing is complete, for the result to be useful, it must be retrieved from the storage and presented. 59
60 MapReduce: Very high level overview MapReduce Programming Model Data type: key-value records Map function: (K in, V in ) list(k inter, V inter ) Reduce function: (K inter, list(v inter )) list(k out, V out ) 60
61 MapReduce Map step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. Map(k1,v1) list(k2,v2) Reduce step: The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve. Reduce(k2, list (v2)) list(v3) Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 61
62 MapReduce: Very high level overview Programming Model map / reduce functions Suitable for embarrassingly parallel problem. Distributed Computing Framework Clusters of commodity hardware Large datasets Fault tolerant Splits jobs into a number of smaller tasks Move code to data (local computation) Allow programs to scale transparently input size Abstract away fault tolerance, synchronization, 62
63 MapReduce By providing a data-parallel programming model, MapReduce can control job execution in useful ways: Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers User focuses on application, not on complexities of distributed computing 63
64 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 64
65 Example: Word Count Assume you have a cluster of 50 computers, each with an attached local disk and half full of web pages. What is a simple parallel programming framework that would support the computation of word counts? 65
66 Example: Word Count Basic Pattern: Strings 1. Extract words from web pages in parallel. 2. Hash and sort words. 3. Count in parallel. 66
67 MapReduce example Input is files with one document per record User specifies map function Input of map Output of map it was the best of times 67 it, 1 was, 1 the, 1 best, 1
68 MapReduce example (cont) MapReduce library gathers together all pairs with the same key value (shuffle/sort phase) The user-defined reduce function combines all the values associated with the same key Ex: Input of reduce key = it values = 1, 1 Output of reduce key = was values = 1, 1 it, 2 was, 2 best, 1 worst, 1 key = best values = 1 key = worst values = 1 68
69 E.g. Common wordcount Hello World Hello MapReduce Fig1: Sample input Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 69
70 E.g. Common wordcount Hello World Hello MapReduce Input MAP Hello, 1 World, 1 First intermediate output Hello, 1 MapReduce, 1 REDUCE Hello, 2 MapReduce, 1 World, 1 Final output Second intermediate output Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 70
71 E.g. Common wordcount void map(string i, string line): for word in line: print word, 1 Fig 2: wordcount map function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez March 2012
72 E.g. Common wordcount void reduce(string word, list partial_counts): total = 0 for c in partial_counts: total += c print word, total Fig 3: wordcount reduce function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 72
73 Why is word count example important? It is one of the most important examples for the type of text processing often done with MapReduce. There is an important mapping document < > data record words < > (field, value) 73
74 Other examples applications Search: Input: (linenumber, line) records Output: lines matching a given pattern Map: if(line matches pattern): output(line) Reduce: identify function Alternative: no reducer (map-only job) 74
75 Other examples applications Inverted index: Input: (filename, text) records Output: list of files containing each word Map: foreach word in text.split(): output(word, filename) Reduce: def reduce(word, filenames): output(word, sort(filenames)) 75
76 Other examples applications Inverted index: hamlet.txt to be or not to be 12th.txt be not afraid of greatness to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt be, 12th.txt not, 12th.txt afraid, 12th.txt of, 12th.txt greatness, 12th.txt (sort) afraid, (12th.txt) be, (12th.txt, hamlet.txt) greatness, (12th.txt) not, (12th.txt, hamlet.txt) of, (12th.txt) or, (hamlet.txt) to, (hamlet.txt) 76
77 When to use MapReduce? Does the problem I am trying to solve decompose into Map and Reduce operation? MapReduce works on any problem that is made up of exactly 2 functions at some level of abstraction: Map: Execute the same operation on all data in the input set Reduce: Execute the same operation on each group of data produced by Map There are a class of algorithms that cannot be efficiently implemented with the MapReduce programming model 77
78 Programming for distributed/parallel systems Main challenges and bottlenecks: Data exchange requires synchronization Bandwidth limitations Temporal dependencies Failures (of the systems performed by commodity hardware) Dealing with multiple parallel computing resources and distributed data resources Different programming models deal with different challenges 78
79 Programming for distributed/parallel systems If the main challenge is to deal with partial failures: MapReduce programming model MapReduce allows easy programming with the expectation of failure Example: assume that you are searching a cluster of servers and one is unable to respond at that moment. What mapreduce will do since it could not access that tree node to the larger Map it will reschedule it for later and perform either the Map or the Reduce then. Essentially it tries to guarantee all information is available with the unpredictability of software and hardware in environments. 79
80 Programming for distributed/parallel systems Main challenges and bottlenecks: Data exchange requires synchronization Bandwidth limitations Temporal dependencies Failures (of the systems performed by commodity hardware) Dealing with multiple parallel computing resources and distributed data resources Different programming models deal with different challenges 80
81 Programming for distributed/parallel systems If the main challenge is to deal with dependencies: Use other programming models E.g. OmpSs/COMPSs programming model Reduce programming parallel applications complexity in complex platforms (multicores/gpus, distributed computing, Clouds) Based on traditional programming languages (C/C++, Java, Fortran) and sequential programming Task based Intelligent runtime Builds a task dependence graph based on directionality hints given by the programmer Perform scheduling a resource management, exploiting potential parallelism Automatic data transfers, exploiting data locality 81
82 OmpSs/COMPSs vs MapReduce OmpSs/COMPSs MapReduce Input Data Input Data Mappers Reducers Output Data Output Data = compute node 82
83 OmpSs/COMPSs vs MapReduce OmpSs/COMPSs MapReduce Data Arbitrary (key, value) pairs structure Functions Arbitrary Map & Reduce Middleware BSC middleware Ease of use Low Hadoop Medium Scope Wide Narrow Graph structure Dynamic Directed Acyclic Graph Two-level Inverted Tree 83
84 Conclusions MapReduce programming model hides the complexity of work distribution and fault tolerance Principal design philosophies: Make it scalable, so you can throw hardware at problems Make it cheap, lowering hardware, programming and admin costs MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time Cloud computing makes it straightforward to start using Hadoop (or other parallel software) at scale 84
85 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 85
86 Hadoop MapReduce Hadoop is the dominant open source MapReduce implementation Funded by Yahoo, it emerged in 2006 The Hadoop project is now hosted by Apache Implemented in Java, (The data to be processed must be loaded into e.g. the Hadoop Distributed Filesystem) Source: Wikipedia 86
87 Hadoop MapReduce Hadoop is an open source MapReduce runtime provided by the Apache Software Foundation De-facto standard, free, open-source MapReduce implementation. Endorsed by: 87
88 Hadoop - Architecture Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 88
89 Hadoop: Very high-level overview When data is loaded into the systems, it is split into blocks of 64Mb/128Mb Map tasks works on typically a single block A master program allocates work to nodes (that work in parallel) such that a Map task will work on a block of data stored locally on that node If a node fails, the master will detect that failure and re-assign the work to a different node on the system 89
90 Hadoop esentials Computation: Move the computation to the data Storage: Keeping track of the data and metadata Data is sharded across the cluster Cluster management tools... 90
91 (default) Hadoop s Stack Applications more detail in next part!!! Compute Services Data Services Storage Services Hadoop s MapReduce Hbase: NoSQL Databases Hadoop Distributed File System (HDFS) Resource Fabrics 91 91
92 Basic Cluster Components One of each: Namenode (NN) Jobtracker (JT) Set of each per slave machine: Tasktracker (TT) Datanode (DN) 92
93 Putting Everything Together namenode job submission node namenode daemon jobtracker tasktracker datanode daemon Linux file system slave node tasktracker datanode daemon Linux file system slave node tasktracker datanode daemon Linux file system slave node 93
94 Anatomy of a Job MapReduce program in Hadoop = Hadoop job Jobs are divided into map and reduce tasks An instance of running a task is called a task attempt Multiple jobs can be composed into a workflow 94
95 Anatomy of a Job Job submission process Client (i.e., driver program) creates a job, configures it, and submits it to job tracker JobClient computes input splits (on client end) Job data (jar, configuration XML) are sent to JobTracker JobTracker puts job data in shared location, enqueues tasks TaskTrackers poll for tasks Off to the races 95
96 Running MapReduce job with Hadoop Steps: Defining the MapReduce stages in a Java program Loading the data into the Hadoop Distributed Filesystem Submitting the job for execution Retrieving the results from the filesystem MapReduce has been implemented in a variety of other programming languages and systems, Several NoSQL database systems have integrated MapReduce (later in this course) 96
97 Is Hadoop really that big a deal? Yes. According to a survey (*) from July, 2011: 54%of organizations surveyed are using or considering Hadoop Over 82% users benefit from faster analyses and better utilization of computing resources 94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data Organizations use Hadoop in particular to work with unstructured data such as logs and event data (63%) (*) Research-Survey-Shows-Organizations-Hadoop-Perform 97
98 Hadoop and enterprise? Hadoop is a complement to a relational data warehouse Enterprises are generally not replacing their relational DataWarehouse with Hadoop Hadoop s strengths Inexpensive High reliability Extreme scalability Flexibility: Data can be added without defining a schema Hadoop s weaknesses Hadoop is not an interactive query environment Processing data in Hadoop requires writing code 98
99 Using MapReduce in the Enterprise 99
100 Who is using Hadoop? ebay is using Hadoop for search optimizing and research via a huge-node cluster. Facebook has two big Hadoop clusters for storing internal log and data sources and for data reporting, analytics and machine learning. Twitter uses Hadoop to store and process all it tweets and other data types generated on the social networking system. Yahoo has a Hadoop cluster of 4,500 nodes for research efforts around its ad systems and Web servers. It's also using it for scaling tests to drive Hadoop development on bigger clusters. Source: Enterprise/ ( ) 100
101 Who is using Hadoop? Source: Wikipedia, April
102 What is MapReduce model used for? At Google: Index construction for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: Web map powering Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection 102
103 Hadoop : The Apache Software Foundation delivers Hadoop 1.0, the much-anticipated 1.0 version of the popular open-source platform for storing and processing large amounts of data. six years of development, production experience, extensive testing, and feedback from hundreds of knowledgeable users, data scientists and systems engineers, culminating in a highly stable, enterprise-ready release of the fastest-growing big data platform. 103
104 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop 104
105 Amazon Elastic MapReduce Provides a web-based interface and command-line tools for running Hadoop jobs on Amazon EC2 Data stored in Amazon S3 Monitors job and shuts down machines after use Small extra charge on top of EC2 pricing If you want more control over how you Hadoop runs, you can launch a Hadoop cluster on EC2 manually using the scripts in src/contrib/ec2 105
106 Amazon Elastic MapReduce 106
107 Elastic MapReduce Workflow 107
108 Elastic MapReduce Workflow 108
109 Elastic MapReduce Workflow 109
110 Elastic MapReduce Workflow 110
111 Content Big Data Challenges Big Data Storage MapReduce architecture When to use MapReduce? Hadoop MapReduce in the Cloud (AWS) Getting started with hadoop and Homework 3! 111
112 Getting Started with Hadoop 112
113 Getting Started with Hadoop Different ways to write jobs: Java API Hadoop Streaming (for Python, Perl, etc) Pipes API (C++) R 113
114 Hadoop API Different APIs to write Hadoop programs: A rich Java API (main way to write Hadoop programs) A Streaming API that can be used to write map and reduce functions in any programming language (using standard inputs and outputs) A C++ API (Hadoop Pipes) With a higher language level (e.g., Pig, Hive) 114
115 Hadoop API Mapper void map(k1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) void configure(jobconf job) void close() throws IOException Reducer/Combiner void reduce(k2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) void configure(jobconf job) void close() throws IOException Partitioner void getpartition(k2 key, V2 value, int numpartitions) 115
116 WordCount.java package org.myorg; import java.io.ioexception; import java.util.*; import org.apache.hadoop.fs.path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { } 116
117 WordCount.java public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); } } } 117
118 WordCount.java public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } 118
119 WordCount.java public static void main(string[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } 119
120 E.g. Common wordcount Hello World Hello MapReduce Fig1: Sample input Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 120
121 E.g. Common wordcount void map(string i, string line): for word in line: print word, 1 Fig 2: wordcount map function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez March 2012
122 E.g. Common wordcount void reduce(string word, list partial_counts): total = 0 for c in partial_counts: total += c print word, total Fig 3: wordcount reduce function Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 122
123 E.g. Common wordcount Hello World Hello MapReduce Input MAP Hello, 1 World, 1 First intermediate output Hello, 1 MapReduce, 1 REDUCE Hello, 2 MapReduce, 1 World, 1 Final output Second intermediate output Source: HADOOP: presentation at EEDC 2012 seminars by Juan Luis Pérez 123
124 Word Count Python Mapper def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) Source: Robert Grossman Tutorial Supercomputing 2011
125 Word Count R Mapper trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) words <- splitintowords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con) Source: Robert Grossman Tutorial Supercomputing 2011
126 Word Count Java Mapper public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context throws IOException, InterruptedException { } } String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } Source: Robert Grossman Tutorial Supercomputing 2011
127 Code Comparison Word Count Mapper Python def read_input(file): for line in file: yield line.split() def main(separator='\t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) Java public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context throws IOException, InterruptedException { } } String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } R trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) words <- splitintowords(line) cat(paste(words, "\t1\n", sep=""), sep="") } close(con) Source: Robert Grossman Tutorial Supercomputing 2011
128 Word Count Python Reducer def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) Source: Robert Grossman Tutorial Supercomputing 2011
129 Word Count R Reducer trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) splitline <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) split <- splitline(line) word <- split$word count <- split$count Source: Robert Grossman Tutorial Supercomputing 2011
130 Word Count R Reducer (cont d) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = " )
131 Word Count Java Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { } int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum));
132 Code Comparison Word Count Reducer Python def read_mapper_output(file, separator='\t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='\t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "\t", get(w, envir = env), "\n", sep = " ) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { R public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { trimwhitespace <- function(line) gsub("(^ +) ( +$)", "", line) splitline <- function(line) { val <- unlist(strsplit(line, "\t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = FALSE)) > 0) { line <- trimwhitespace(line) split <- splitline(line) word <- split$word count <- split$count } } int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); Source: Robert Grossman Tutorial Supercomputing 2011
133 HOMEWORK 3: Groups of 2 or 3 students Vowels Count program * You can use: Amazon Eslastic MapReduce your own local hadoop installation Presentation day: Monday 02/12/2013 One/Two groups will be randomly choosen Slides (or web page): Hands-on stile (*) we could agree another program 133
134 Some Resources Hadoop, the definitive guide. Tom White. O Really, 2012 Hadoop: Video tutorials Cloudera: Video tutorials MapR: Amazon Elastic MapReduce guide: GettingStartedGuide/ Slides from PARLab: Parallel Boot Camp Cloud Computing with MapReduce and Hadoop by Matei Zaharia Electrical Engineering and Computer Sciences University of California, Berkeley ootcamp_clouds/
MapReduce programming model
MapReduce programming model technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate
More informationOutline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop
Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School September 2012 This work is licensed
More informationMap-Reduce in Various Programming Languages
Map-Reduce in Various Programming Languages 1 Context of Map-Reduce Computing The use of LISP's map and reduce functions to solve computational problems probably dates from the 1960s -- very early in the
More informationBig Data landscape Lecture #2
Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13
More informationPARLab Parallel Boot Camp
PARLab Parallel Boot Camp Cloud Computing with MapReduce and Hadoop Matei Zaharia Electrical Engineering and Computer Sciences University of California, Berkeley What is Cloud Computing? Cloud refers to
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve
More informationClustering Documents. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve
More informationIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -
More informationBig Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme
Big Data Analytics 4. Map Reduce I Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany original slides by Lucas Rego
More informationData-Intensive Computing with MapReduce
Data-Intensive Computing with MapReduce Session 2: Hadoop Nuts and Bolts Jimmy Lin University of Maryland Thursday, January 31, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationEE657 Spring 2012 HW#4 Zhou Zhao
EE657 Spring 2012 HW#4 Zhou Zhao Problem 6.3 Solution Referencing the sample application of SimpleDB in Amazon Java SDK, a simple domain which includes 5 items is prepared in the code. For instance, the
More informationMapReduce Simplified Data Processing on Large Clusters
MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /
More informationIntroduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece
Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of
More informationHadoop/MapReduce Computing Paradigm
Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications
More informationIntroduction to Hadoop
Hadoop 1 Why Hadoop Drivers: 500M+ unique users per month Billions of interesting events per day Data analysis is key Need massive scalability PB s of storage, millions of files, 1000 s of nodes Need cost
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationCS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor
CS 470 Spring 2018 Mike Lam, Professor Parallel Algorithm Development (Foster's Methodology) Graphics and content taken from IPP section 2.7 and the following: http://www.mcs.anl.gov/~itf/dbpp/text/book.html
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationUNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus
UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationParallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018
Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationLarge-scale Information Processing
Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de Anecdotal evidence... I think there is a world market for about five computers,
More informationDistributed Systems CS6421
Distributed Systems CS6421 Intro to Distributed Systems and the Cloud Prof. Tim Wood v I teach: Software Engineering, Operating Systems, Sr. Design I like: distributed systems, networks, building cool
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (1/2) January 10, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationBig Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop
More informationData Deluge. Billions of users connected through the net. Storage getting cheaper Store more data!
Hadoop 1 Data Deluge Billions of users connected through the net WWW, FB, twitter, cell phones, 80% of the data on FB was produced last year Storage getting cheaper Store more data! Why Hadoop Drivers:
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationSpark and Cassandra. Solving classical data analytic task by using modern distributed databases. Artem Aliev DataStax
Spark and Cassandra Solving classical data analytic task by using modern distributed databases Artem Aliev DataStax Spark and Cassandra Solving classical data analytic task by using modern distributed
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationWhat is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?
Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation
More informationThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationTeaching Map-reduce Parallel Computing in CS1
Teaching Map-reduce Parallel Computing in CS1 Richard Brown, Patrick Garrity, Timothy Yates Mathematics, Statistics, and Computer Science St. Olaf College Northfield, MN rab@stolaf.edu Elizabeth Shoop
More informationBig Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela
More informationIntroduction to Map/Reduce & Hadoop
Introduction to Map/Reduce & Hadoop Vassilis Christophides christop@csd.uoc.gr http://www.csd.uoc.gr/~hy562 University of Crete 1 Peta-Bytes Data Processing 2 1 1 What is MapReduce? MapReduce: programming
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox April 16 th, 2015 Emily Fox 2015 1 Document Retrieval n Goal: Retrieve
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More informationParallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014
Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationGetting Started with Hadoop
Getting Started with Hadoop May 28, 2018 Michael Völske, Shahbaz Syed Web Technology & Information Systems Bauhaus-Universität Weimar 1 webis 2018 What is Hadoop Started in 2004 by Yahoo Open-Source implementation
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationMap- reduce programming paradigm
Map- reduce programming paradigm Some slides are from lecture of Matei Zaharia, and distributed computing seminar by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. In pioneer days they
More informationCloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1
Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L12: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationHADOOP FRAMEWORK FOR BIG DATA
HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationHierarchy of knowledge BIG DATA 9/7/2017. Architecture
BIG DATA Architecture Hierarchy of knowledge Data: Element (fact, figure, etc.) which is basic information that can be to be based on decisions, reasoning, research and which is treated by the human or
More informationMapReduce-style data processing
MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationLab 11 Hadoop MapReduce (2)
Lab 11 Hadoop MapReduce (2) 1 Giới thiệu Để thiết lập một Hadoop cluster, SV chọn ít nhất là 4 máy tính. Mỗi máy có vai trò như sau: - 1 máy làm NameNode: dùng để quản lý không gian tên (namespace) và
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationDept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File
More information2013 AWS Worldwide Public Sector Summit Washington, D.C.
2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationData Analysis Using MapReduce in Hadoop Environment
Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti
More informationPARALLEL DATA PROCESSING IN BIG DATA SYSTEMS
PARALLEL DATA PROCESSING IN BIG DATA SYSTEMS Great Ideas in ICT - June 16, 2016 Irene Finocchi (finocchi@di.uniroma1.it) Title keywords How big? The scale of things Data deluge Every 2 days we create as
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationSTATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns
STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big
More informationMap Reduce & Hadoop Recommended Text:
Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data
More informationCloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1
Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L3b: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand
More informationTop 25 Big Data Interview Questions And Answers
Top 25 Big Data Interview Questions And Answers By: Neeru Jain - Big Data The era of big data has just begun. With more companies inclined towards big data to run their operations, the demand for talent
More informationSeptember 2013 Alberto Abelló & Oscar Romero 1
duce-i duce-i September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate several use cases of duce 2. Describe what the duce environment is 3. Explain 6 benefits of using duce 4.
More informationCloud Computing. Up until now
Cloud Computing Lecture 9 Map Reduce 2010-2011 Introduction Up until now Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing Distributed Scheduling 1 Outline Map Reduce:
More informationIntroduction to Big-Data
Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationIntroduction to Map/Reduce & Hadoop
Introduction to Map/Reduce & Hadoop V. CHRISTOPHIDES University of Crete & INRIA Paris 1 Peta-Bytes Data Processing 2 1 1 What is MapReduce? MapReduce: programming model and associated implementation for
More informationECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing
ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters
More informationA BigData Tour HDFS, Ceph and MapReduce
A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!
More informationOverview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.
MapReduce and HDFS This presentation includes course content University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents: Overview Why MapReduce? What
More informationINTRODUCTION TO HADOOP
Hadoop INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 2 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = 10 21 Bytes)
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce
More informationEXTRACT DATA IN LARGE DATABASE WITH HADOOP
International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0
More informationHow Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,
How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS
More informationHadoop. copyright 2011 Trainologic LTD
Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides
More informationIntroduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems
Introduction to Hadoop Scott Seighman Systems Engineer Sun Microsystems 1 Agenda Identify the Problem Hadoop Overview Target Workloads Hadoop Architecture Major Components > HDFS > Map/Reduce Demo Resources
More informationData Platforms and Pattern Mining
Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering,
More informationThe MapReduce Framework
The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce
More informationMassive Online Analysis - Storm,Spark
Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationInternet Measurement and Data Analysis (13)
Internet Measurement and Data Analysis (13) Kenjiro Cho 2016-07-11 review of previous class Class 12 Search and Ranking (7/4) Search systems PageRank exercise: PageRank algorithm 2 / 64 today s topics
More informationBig Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2
Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer
More informationMapReduce and Hadoop
Università degli Studi di Roma Tor Vergata MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference Big Data stack High-level Interfaces Data Processing
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationInternational Journal of Advance Engineering and Research Development. A Study: Hadoop Framework
Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja
More informationCOMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.
COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example
More informationLarge-Scale GPU programming
Large-Scale GPU programming Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Assistant Adjunct Professor Computer and Information Science Dept. University
More informationIntroduction to Map Reduce
Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationThe amount of data increases every day Some numbers ( 2012):
1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect
More information2/26/2017. The amount of data increases every day Some numbers ( 2012):
The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to
More informationLecture 12 DATA ANALYTICS ON WEB SCALE
Lecture 12 DATA ANALYTICS ON WEB SCALE Source: The Economist, February 25, 2010 The Data Deluge EIGHTEEN months ago, Li & Fung, a firm that manages supply chains for retailers, saw 100 gigabytes of information
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More information