INTRODUCTION TO HADOOP
|
|
- Mercy Parsons
- 6 years ago
- Views:
Transcription
1 Hadoop
2 INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 2
3 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = Bytes) NYSE produces 1 TB of data per day The Internet Archive grows by 20 TB per month The LHC produces 15 PB of data per year AT&T has a 300 TB database 100 TB of data uploaded daily on Facebook Distributed Systems + Middleware: Hadoop 3
4 Data Personal data is growing, too E.g., photos: a single photo taken with a Nikon commercial camera takes about 6 MB (default settings); a year of family photos takes about 8 GB of space adding to the slices of personal stuff uploaded on social networks, video Websites, blogs and more Machine-produced data is also growing Machine logs Sensor networks and monitored data Distributed Systems + Middleware: Hadoop 4
5 Data analysis Main problem: disk reading speed / capacity has not really improved Solution: parallelize the storage and read less data from each disk Problems: hardware replication, data aggregation Take, for example, a RDBMS (keeping in mind that the seeking time on disks measures the latency of the operations): Updating records is fast: a B-Tree structure is efficient Reading many records is slow: if access is dominated by seeking time, it is faster to read the entire disk (which operates at transfer time) Distributed Systems + Middleware: Hadoop 5
6 Hadoop Reliable data storage: Hadoop Distributed File System Data analysis: MapReduce implementation Many tools for the developers Easy cluster administration Query languages Some are similar to SQL Column-oriented distributed databases on top of Hadoop Structured to unstructured repositories and back Distributed Systems + Middleware: Hadoop 6
7 Hadoop vs. The (existing) World RDBMS: Disk seek time Some types of data are not normalized (e.g., logs): MapReduce works well with unstructured data MapReduce scales linearly (while a RDBMS does not) Volunteer computing (e.g., SETI@home) Similar model, but Hadoop works in a localized cluster sharing high-performance bandwidth, while volunteer computing works over the Internet on untrusted computers performing other operations meanwhile Distributed Systems + Middleware: Hadoop 7
8 Hadoop vs. The (existing) World MPI: Works well for compute-intensive jobs, but network becomes the bottleneck when hundreds of GB of data have to be analyzed Conversely, MapReduce does its best to exploit data locality by collocate the data with the compute node (network bandwidth is the most precious resource, it must not be wasted) MapReduce operates at a higher level wrt MPI: data flow is already taken care MapReduce implements failure recovery (in MPI the developer has to handle checkpoints and failure recovery) Distributed Systems + Middleware: Hadoop 8
9 HADOOP HISTORY Distributed Systems + Middleware: Hadoop 9
10 Brief history In 2002, Mike Cafarella and Doug Cutting started working on Apache Nutch, a new Web search engine In 2003, Google published a paper on the Google File System, a distributed filesystem, and Mike and Doug started working on a similar, open source, project In 2004, Google published another paper, on the MapReduce computation model, and yet again Mike and Doug implemented an open source version in Nutch Distributed Systems + Middleware: Hadoop 10
11 Brief history In 2006, these two projects separated from Nutch and became Hadoop In the same year, Doug Cutting started working for Yahoo! and started using Hadoop there In 2008 Hadoop was used by Yahoo! (10000-core cluster), Last.fm, Facebook and the NYT In 2009, Yahoo! broke the world record for sorting 1 TB of data in 62 seconds, using Hadoop Since then, Hadoop became mainstream in industry Distributed Systems + Middleware: Hadoop 11
12 Examples from the Real World Last.fm Each user listening to a song (local or in streaming) generates a trace Hadoop analyses these traces to produce charts Facebook E.g., track statistics per user and per country, weekly top tracks Daily and hourly summaries over user logs Products usage, ads campaigns Ad-hoc jobs over historical data Long term archival store Integrity checks Distributed Systems + Middleware: Hadoop 12
13 Examples from the Real World Nutch search engine Link inversion: find outgoing links that point to a specific Web page URL fetching Produce Lucene indexes (for text searches) Infochimps: explore network graphs Social networks: Twitter analysis, measure communities Biology: neuron connections in roundworms Street connections: OpenStreetMap Distributed Systems + Middleware: Hadoop 13
14 Hadoop umbrella HDFS: distributed filesystem MapReduce: distributed data processing model MRUnit: unit testing of MapReduce applications Pig: data flow language to explore large datasets Hive: distributed data warehouse HBase: distributed, column-oriented db ZooKeeper: distributed coordination service Sqoop: efficient bulk transfers of data over HDFS Distributed Systems + Middleware: Hadoop 14
15 MAPREDUCE BY EXAMPLE Distributed Systems + Middleware: Hadoop 15
16 The Word Count Example Count the number of times each word occurs in a set of documents. Example: one document with one sentence Do as I say, not as I do Distributed Systems + Middleware: Hadoop 16
17 A possible solution A Multiset is a set where each element also has a count. define wordcount as Multiset; for each document in documentset { T = tokenize(document); for each token in T { wordcount[token]++; } } display(wordcount); Distributed Systems + Middleware: Hadoop 17
18 Problems with our solution This program works fine until the set of documents you want to process becomes large. E.g. Spam filter to know the words frequently used in the millions of spam s you receive. Looping through the documents using a single computer will be extremely time consuming Possible alternative solution: Speed it up by rewriting the program so that it distributes the work over several machines. Each machine will process a distinct fraction of the documents. When all the machines have completed this, a second phase of processing will combine the result of all the machines. Distributed Systems + Middleware: Hadoop 18
19 Possible re-write define wordcount as Multiset; for each document in documentsubset { T = tokenize(document); for each token in T { } wordcount[token]++; } sendtosecondphase(wordcount); define totalwordcount as Multiset; for each wordcount received from firstphase { multisetadd (totalwordcount, wordcount); } Distributed Systems + Middleware: Hadoop 19
20 Possible Problems We ignore the performance requirement of reading in the documents. If the documents are all stored in one central storage server, then the bottleneck is in the bandwidth of that server. We need to split up the documents among the set of processing machines such that each machine will process only those documents that are stored in it. Storage and processing have to be tightly coupled in dataintensive distributed applications wordcount (and totalwordcount) are stored in memory. When processing large document sets, the number of unique words can exceed the RAM storage of a machine We need to rewrite our program to store this hash table on disk (lots of code) Distributed Systems + Middleware: Hadoop 20
21 Even more problems Phase two has only one machine, which will process wordcount sent from all the machines in phase one After we have added enough machines to phase one processing, the single machine in phase two will become the bottleneck We need to rewrite phase two in a distributed fashion so that it can scale by adding more machines Distributed Systems + Middleware: Hadoop 21
22 Final Solution subset 1 Phase 1 word count - a word count - b word count - c Reshuffle Phase 2 word count - x word count - y word count - z A 26 machines for phase 2 One per letter in alphabet subset 2 word count - a word count - b word count - c word count - x word count - y word count - z B C 26 disk-based hash tables for wordcount subset n-2 word count - a word count - b word count - c word count - x word count - y word count - z X word count - a word count - b word count - c Y subset n word count - x word count - y word count - z Z Distributed Systems + Middleware: Hadoop 22
23 Considerations Starts getting complex Requirements Store files over many processing machines (of phase one). Write a disk-based hash table permitting processing without being limited by RAM capacity. Partition the intermediate data (that is, wordcount) from phase one. Shuffle the partitions to the appropriate machines in phase two. And we re still not dealing with possible failures!!! Distributed Systems + Middleware: Hadoop 23
24 MapReduce Model for analyzing large amounts of data Unstructured data is organized as key value datasets and lists Two phases: map(k1, v1) -> list(k2, v2), where the input domain is different from the output domain filter and transform (shuffle: intermediate phase to sort the output of map and group by key) default implementations reduce(k2, list(v2)) -> list(v3), where the input and output domain is the same aggregate Distributed Systems + Middleware: Hadoop 24
25 Examples of inputs <k1, v1> Multiple files list(<string filename, String file_content>) One large log file list(<integer line_number, String log_event>) Lists are broken up and each individual pair is processed by the map function Each input becomes a list(<k2,v2>) Distributed Systems + Middleware: Hadoop 25
26 WordCount in MapReduce Mapper Input <String filename, String file_ content> Ignores filename Mapper Output list of <String word, Integer count> (e.g., <"foo", 3>) or list of <String word, Integer 1> (e.g., < foo, 1>) With repeated entries Easier to program All pairs sharing the same k2 are grouped They form a <k2, list(v2)> Aggregation by Reducer Two mappers produce < foo, list(1,1)> and < foo, list(1,1,1)> The aggregated pair the reducer sees is <"foo", list(1,1,1,1,1)>. Reducer produces < foo, 5> Distributed Systems + Middleware: Hadoop 26
27 How would we write this? map(string filename, String document) { } List<String> T = tokenize(document); for each token in T { } emit ((String)token, (Integer) 1); reduce(string token, List<Integer> values) { } Integer sum = 0; for each value in values { sum = sum + value; } emit ((String)token, (Integer) sum); Distributed Systems + Middleware: Hadoop 27
28 MOVING TO HADOOP Distributed Systems + Middleware: Hadoop 28
29 Building Blocks (daemons) On a fully configured cluster, running Hadoop means running multiple daemons NameNode DataNode Secondary NameNode JobTracker TaskTracker Distributed Systems + Middleware: Hadoop 29
30 NameNode Politecnico Hadoop uses a master/slave configuration both for distributed storage and distributed computation Distributed Storage is called HDFS NameNode is the master of HDFS It directs slave DataNodes to perform low-level I/O Keeps track of how files are broken down into file-blocks which nodes store those blocks, and the overall health of the distributed filesystem The function of the NameNode is memory and I/O intensive. As such, the server hosting the NameNode typically doesn t store any user data or perform any computations single point of failure!!! Distributed Systems + Middleware: Hadoop 30
31 DataNode Each slave machine in the cluster will host a DataNode daemon for reading and writing HDFS blocks to actual files on the local filesystem Files are broken into blocks and the NameNode tells a client which DataNode each block resides in The clients communicate directly with the DataNode daemons to process the local files DataNodes may replicate data blocks for redundancy. Distributed Systems + Middleware: Hadoop 31
32 HDFS Example Distributed Systems + Middleware: Hadoop 32
33 Secondary NameNode The Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster HDFS. each cluster has one SNN, and it typically resides on its own machine The SNN differs from the NameNode in that this process doesn t receive or record any real-time changes to HDFS. Instead, it communicates with the NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration. Does not solve the single point of failure Still requires human intervention if the NameNode fails Distributed Systems + Middleware: Hadoop 33
34 JobTracker There is only one JobTracker daemon per Hadoop cluster. It s typically run on a server as a master node of the cluster. Once you submit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they re running. Should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries. Distributed Systems + Middleware: Hadoop 34
35 Task Tracker TaskTrackers manage the execution of individual tasks on each slave node Although there is a single TaskTracker per slave node, each TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in parallel. TaskTrackers constantly communicate with the JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster Distributed Systems + Middleware: Hadoop 35
36 Job Submission Distributed Systems + Middleware: Hadoop 36
37 Summary of Architecture Distributed Systems + Middleware: Hadoop 37
38 USING HADOOP Distributed Systems + Middleware: Hadoop 38
39 HDFS HDFS is a filesystem designed for large-scale distributed data processing It is possible to store a big data set of (say) 100 TB as a single file in HDFS HDFS abstracts details away and gives the illusion that we re dealing with a single file In a typical Hadoop workflow files are created elsewhere and copied into HDFS using command line utilities MapReduce programs process this data, but they don t read/write HDFS files directly Distributed Systems + Middleware: Hadoop 39
40 HDFS Commandline Utilities hdfs dfs mkdir /user/chuck hdfs dfs -ls / hdfs dfs ls -R / hdfs dfs -put example.txt / hdfs dfs -get /example.txt / hdfs dfs cat /example.txt hdfs dfs rm /example.txt Distributed Systems + Middleware: Hadoop 40
41 Anatomy of a Hadoop Application 41
42 Data Types MapReduce framework has a certain defined way of serializing the key/value pairs to move them across the cluster s network only classes that support this kind of serialization can function as keys or values in the framework. Classes that implement the Writable interface can be values the WritableComparable<T> interface can be either keys or values Distributed Systems + Middleware: Hadoop 42
43 Predefined Types Distributed Systems + Middleware: Hadoop 43
44 Mapper To serve as the mapper, a class implements from the Mapper interface and inherits the MapReduceBase class. The MapReduceBase class serves as the base class for both mappers and reducers. It includes two methods that effectively act as the constructor and destructor for the class: void configure(jobconf job) In this function you can extract the parameters set either by the configuration XML files or in the main class of your application. void close() As the last action before the map task terminates, this function should wrap up any loose ends Distributed Systems + Middleware: Hadoop 44
45 Mapper The Mapper interface is responsible for the data processing step. It utilizes Java generics of the form Mapper<K1,V1,K2,V2> where the key classes and value classes implement the WritableComparable and Writable interfaces. One method to process an individual (key/value) pair void map(k1 key, V1 value, OutputCollector<K2,V2> output, Reporter reporter ) throws IOException Distributed Systems + Middleware: Hadoop 45
46 Predefined Mappers Distributed Systems + Middleware: Hadoop 46
47 Reducer void reduce(k2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter) throws IOException When the reducer task receives the output from the various mappers, it sorts the incoming data on the key of the (key/value) pair and groups together all values of the same key. The reduce() function is then called, and it generates a (possibly empty) list of (K3, V3) pairs by iterating over the values associated with a given key. Distributed Systems + Middleware: Hadoop 47
48 Predefined Reducers Distributed Systems + Middleware: Hadoop 48
49 Partitioner With multiple reducers, we need some way to determine the appropriate one to send a (key/value) pair outputted by a mapper. The default behavior is to hash the key to determine the reducer. We can define application-specific Partitioners by implementing the Partitioner Interface Distributed Systems + Middleware: Hadoop 49
50 Combiner (or local reduce) Politecnico In many situations with MapReduce applications, we may wish to perform a local reduce before we distribute the mapper results. Send 1 <word, 574> pair instead of 574 <word, 1> pairs The shapes represents keys, the inner patterns represent values. 50
51 Reading and Writing to HDFS Input data usually resides in large files, typically tens or hundreds of gigabytes or even more. One of the fundamental principles of MapReduce s processing power is the splitting of the input data into splits Reads are done through FSDataInputStream FSDataInputStream extends DataInputStream with random read access MapReduce requires this because a machine may be assigned to process a split that sits right in the middle of an input file. Distributed Systems + Middleware: Hadoop 51
52 InputFormat The way an input file is split up and read by Hadoop is defined by one of the implementations of the InputFormat interface. TextInputFormat is the default InputFormat implementation The key returned by TextInputFormat is the byte offset of each line One can create their own InputFormat public interface InputFormat<K, V> { InputSplit[] getsplits(jobconf job, int numsplits) throws IOException; RecordReader<K, V> getrecordreader(inputsplit split, JobConf job, Reporter reporter) throws IOException; } Distributed Systems + Middleware: Hadoop 52
53 Common InputFormats Politecnico 53
54 Output Formats The default OutputFormat is TextOutputFormat, which writes each record as a line of text. Each record s key and value are converted to strings through tostring(), and a tab (\t) character separates them. The separator character can be changed in the mapred.textoutputformat.separator property. TextOutputFormat outputs data in a format readable by KeyValueTextInputFormat. Distributed Systems + Middleware: Hadoop 54
55 Common Output Formats Distributed Systems + Middleware: Hadoop 55
56 THE WORDCOUNT EXAMPLE Distributed Systems + Middleware: Hadoop 56
57 WordCount 2.0 public class WordCount2 { } public static void main(string[] args) { JobClient client = new JobClient(); } JobConf conf = new JobConf(WordCount2.class); FileInputFormat.addInputPath(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(longwritable.class); conf.setmapperclass(tokencountmapper.class); conf.setcombinerclass(longsumreducer.class); conf.setreducerclass(longsumreducer.class); client.setconf(conf); JobClient.runJob(conf); 57
58 CONFIGURING HADOOP Distributed Systems + Middleware: Hadoop 58
59 Different Running Modes Hadoop can be run in three different modes Local mode default mode for Hadoop Hadoop will run completely on the local machine the standalone mode doesn t use HDFS, It does not launch any of the Hadoop daemons Pseudo-distributed mode running Hadoop in a cluster of one daemons running on a single machine They communicate through ssh Fully-distributed mode Deployed to multiple machines Distributed Systems + Middleware: Hadoop 59
60 Pseudo-distributed mode core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. </description> </property> </configuration> Distributed Systems + Middleware: Hadoop 60
61 Pseudo-distributed mode mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> <description>the host and port that the MapReduce job tracker runs at.</description> </property> </configuration> Distributed Systems + Middleware: Hadoop 61
62 Pseudo-distributed mode hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>the actual number of replications can be specified when the file is created.</description> </property> </configuration> Distributed Systems + Middleware: Hadoop 62
63 Masters and Slaves files cat masters localhost cat slaves localhost Distributed Systems + Middleware: Hadoop 63
64 Pseudo-distributed mode Check SSH ssh localhost If not configured ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys Format HDFS hdfs namenode -format Distributed Systems + Middleware: Hadoop 64
65 What s running [hadoop-user@master]$ jps Jps TaskTracker SecondaryNameNode NameNode DataNode JobTracker Distributed Systems + Middleware: Hadoop 65
66 Fully distributed mode core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. </description> </property> </configuration> Distributed Systems + Middleware: Hadoop 66
67 Fully-distributed mode mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>master:9001</value> <description>the host and port that the MapReduce job tracker runs at.</ description> </property> </configuration> Distributed Systems + Middleware: Hadoop 67
68 Fully-distributed mode hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>3</value> <description>the actual number of replications can be specified when the file is created.</description> </property> </configuration> Distributed Systems + Middleware: Hadoop 68
69 Masters and Slaves files cat masters backup cat slaves hadoop1 hadoop2 hadoop3 Distributed Systems + Middleware: Hadoop 69
70 What s running? [hadoop-user@master]$ jps JobTracker NameNode Jps [hadoop-user@backup]$ jps 2099 Jps 1679 SecondaryNameNode [hadoop-user@hadoop1]$ jps 7101 TaskTracker 7617 Jps 6988 DataNode Distributed Systems + Middleware: Hadoop 70
71 PATENT EXAMPLE Distributed Systems + Middleware: Hadoop 71
72 Two data sources Patent citation data contains citations from U.S. patents issued between 1975 and It has more than 16 million rows Patent description data "CITING","CITED" , , , , , , , , , "PATENT","GYEAR","GDATE","APPYEAR","CO UNTRY","POSTATE","ASSIGNEE", It has the patent "ASSCODE","CLAIMS","NCLASS","CAT","SUBCA T","CMADE","CRECEIVE", number, the patent "RATIOCIT","GENERAL","ORIGINAL","FWDAP application year, the LAG","BCKGTLAG","SELFCTUB", patent grant year, the "SELFCTLB","SECDUPBD","SECDLWBD" ,1963,1096,,"BE","",,1,,269,6,69,,1,,0,,,,,,, number of claims, and ,1963,1096,,"US","TX",,1,,2,6,63,,0,,,,,,,,, other metadata ,1963,1096,,"US","IL",,1,,2,6,63,,9,,0.3704,,,,,,, ,1963,1096,,"US","OH",,1,,2,6,63,,3,,0.6667,,,,,,, Distributed Systems + Middleware: Hadoop ,1963,1096,,"US","CA",,1,,2,6,63,,1,,0,,,,,,, 72...
73 What do citations look like? Politecnico 73
74 For each patent find and group the patents that cite it INVERT THE DATA Distributed Systems + Middleware: Hadoop 74
75 Count the number of citations a patent has received COUNT CITATIONS Distributed Systems + Middleware: Hadoop 75
76 How many patents have been cited n times COUNT THE CITATION COUNTS Distributed Systems + Middleware: Hadoop 76
77 SENSOR DATA Distributed Systems + Middleware: Hadoop 77
Hadoop. copyright 2011 Trainologic LTD
Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationHortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :
Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationitpass4sure Helps you pass the actual test with valid and latest training material.
itpass4sure http://www.itpass4sure.com/ Helps you pass the actual test with valid and latest training material. Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Vendor : Cloudera
More informationOutline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop
Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School September 2012 This work is licensed
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationGhislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)
Ghislain Fourny Big Data Fall 2018 6. Massive Parallel Processing (MapReduce) Let's begin with a field experiment 2 400+ Pokemons, 10 different 3 How many of each??????????? 4 400 distributed to many volunteers
More informationGhislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)
Ghislain Fourny Big Data 6. Massive Parallel Processing (MapReduce) So far, we have... Storage as file system (HDFS) 13 So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14 Data
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationIntroduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece
Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationHADOOP FRAMEWORK FOR BIG DATA
HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further
More informationMap Reduce & Hadoop Recommended Text:
Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange
More informationParallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014
Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example
More informationBig Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop
More informationIntroduction to Map Reduce
Introduction to Map Reduce 1 Map Reduce: Motivation We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File
More informationData-Intensive Computing with MapReduce
Data-Intensive Computing with MapReduce Session 2: Hadoop Nuts and Bolts Jimmy Lin University of Maryland Thursday, January 31, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationCS-2510 COMPUTER OPERATING SYSTEMS
CS-2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application Functional
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationIntroduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems
Introduction to Hadoop Scott Seighman Systems Engineer Sun Microsystems 1 Agenda Identify the Problem Hadoop Overview Target Workloads Hadoop Architecture Major Components > HDFS > Map/Reduce Demo Resources
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationBig Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce
More informationUNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus
UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationApril Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.
1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationData Analytics Job Guarantee Program
Data Analytics Job Guarantee Program 1. INSTALLATION OF VMWARE 2. MYSQL DATABASE 3. CORE JAVA 1.1 Types of Variable 1.2 Types of Datatype 1.3 Types of Modifiers 1.4 Types of constructors 1.5 Introduction
More informationIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -
More informationA brief history on Hadoop
Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)
More informationHadoop Quickstart. Table of contents
Table of contents 1 Purpose...2 2 Pre-requisites...2 2.1 Supported Platforms... 2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster...3 5 Standalone
More informationInternational Journal of Advance Engineering and Research Development. A Study: Hadoop Framework
Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja
More informationHadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391
Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391 Outline Big Data Big Data Examples Challenges with traditional storage NoSQL Hadoop HDFS MapReduce Architecture 2 Big Data In information
More informationDept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,
More informationExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you
ExamTorrent http://www.examtorrent.com Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you Exam : Apache-Hadoop-Developer Title : Hadoop 2.0 Certification exam for Pig
More informationVendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo
Vendor: Cloudera Exam Code: CCD-410 Exam Name: Cloudera Certified Developer for Apache Hadoop Version: Demo QUESTION 1 When is the earliest point at which the reduce method of a given Reducer can be called?
More informationVendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo
Vendor: Hortonworks Exam Code: HDPCD Exam Name: Hortonworks Data Platform Certified Developer Version: Demo QUESTION 1 Workflows expressed in Oozie can contain: A. Sequences of MapReduce and Pig. These
More informationCCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)
Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE
More informationCSE6331: Cloud Computing
CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2017 by Leonidas Fegaras Map-Reduce Fundamentals Based on: J. Simeon: Introduction to MapReduce P. Michiardi: Tutorial on MapReduce
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More information50 Must Read Hadoop Interview Questions & Answers
50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?
More informationLaarge-Scale Data Engineering
Laarge-Scale Data Engineering The MapReduce Framework & Hadoop Key premise: divide and conquer work partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 result combine Parallelisation challenges How
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationIntroduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.
Introduction to MapReduce Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Before MapReduce Large scale data processing was difficult! Managing hundreds or thousands of processors Managing parallelization
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationActual4Dumps. Provide you with the latest actual exam dumps, and help you succeed
Actual4Dumps http://www.actual4dumps.com Provide you with the latest actual exam dumps, and help you succeed Exam : HDPCD Title : Hortonworks Data Platform Certified Developer Vendor : Hortonworks Version
More informationImporting and Exporting Data Between Hadoop and MySQL
Importing and Exporting Data Between Hadoop and MySQL + 1 About me Sarah Sproehnle Former MySQL instructor Joined Cloudera in March 2010 sarah@cloudera.com 2 What is Hadoop? An open-source framework for
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationBig Data and Scripting map reduce in Hadoop
Big Data and Scripting map reduce in Hadoop 1, 2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationSouth Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10
ISSN Number (online): 2454-9614 Weather Data Analytics using Hadoop Components like MapReduce, Pig and Hive Sireesha. M 1, Tirumala Rao. S. N 2 Department of CSE, Narasaraopeta Engineering College, Narasaraopet,
More informationThe amount of data increases every day Some numbers ( 2012):
1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect
More information2/26/2017. The amount of data increases every day Some numbers ( 2012):
The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationHadoop/MapReduce Computing Paradigm
Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications
More informationCloud Computing CS
Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part
More informationInforma)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies
Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:
More informationDatabase Applications (15-415)
Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April
More informationOverview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.
MapReduce and HDFS This presentation includes course content University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents: Overview Why MapReduce? What
More informationBig Data landscape Lecture #2
Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13
More informationHadoop. Introduction to BIGDATA and HADOOP
Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL
More informationInria, Rennes Bretagne Atlantique Research Center
Hadoop TP 1 Shadi Ibrahim Inria, Rennes Bretagne Atlantique Research Center Getting started with Hadoop Prerequisites Basic Configuration Starting Hadoop Verifying cluster operation Hadoop INRIA S.IBRAHIM
More informationCloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]
s@lm@n Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ] Question No : 1 Which two updates occur when a client application opens a stream
More informationParallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018
Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationHadoop MapReduce Framework
Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (1/2) January 10, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationData Partitioning and MapReduce
Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,
More informationIntroducing Hadoop. This chapter covers
1 Introducing Hadoop This chapter covers The basics of writing a scalable, distributed data-intensive program Understanding Hadoop and MapReduce Writing and running a basic MapReduce program Today, we
More informationIN ACTION. Chuck Lam SAMPLE CHAPTER MANNING
IN ACTION Chuck Lam SAMPLE CHAPTER MANNING Hadoop in Action by Chuck Lam Chapter 1 Copyright 2010 Manning Publications brief contents PART I HADOOP A DISTRIBUTED PROGRAMMING FRAMEWORK... 1 1 Introducing
More informationMapReduce. Stony Brook University CSE545, Fall 2016
MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining
More informationMapReduce Simplified Data Processing on Large Clusters
MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /
More informationDistributed Systems 16. Distributed File Systems II
Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS
More informationSTATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns
STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big
More informationCloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1
Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L12: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand
More informationA BigData Tour HDFS, Ceph and MapReduce
A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationSession 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi
Session 1 Big Data and Hadoop - Overview - Dr. M. R. Sanghavi Acknowledgement Prof. Kainjan M. Sanghavi For preparing this prsentation This presentation is available on my blog https://maheshsanghavi.wordpress.com/expert-talk-fdp-workshop/
More informationIntroduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)
Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction
More informationMap Reduce. Yerevan.
Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve
More informationClustering Documents. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve
More informationCopyright 2012, Oracle and/or its affiliates. All rights reserved.
1 Big Data Connectors: High Performance Integration for Hadoop and Oracle Database Melli Annamalai Sue Mavris Rob Abbott 2 Program Agenda Big Data Connectors: Brief Overview Connecting Hadoop with Oracle
More informationThe MapReduce Framework
The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce
More informationImproving the MapReduce Big Data Processing Framework
Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM
More informationBig Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2
Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer
More informationTutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements
Tutorial Outline Map/Reduce Map/reduce Hadoop Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering Department The University of Texas at Arlington, Arlington, TX 76009
More informationHadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)
Hortonworks Hadoop-PR000007 Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer) http://killexams.com/pass4sure/exam-detail/hadoop-pr000007 QUESTION: 99 Which one of the following
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Pattern Hadoop Mix Graphs Giraph Spark Zoo Keeper Spark But first Partitioner & Combiner
More informationThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,
More informationDistributed Filesystem
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
More informationCloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1
Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L3b: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand
More informationDistributed Computations MapReduce. adapted from Jeff Dean s slides
Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)
More informationKillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX
KillTest Q&A Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Version : DEMO 1 / 4 1.When is the earliest point at which the reduce method of a given Reducer can be called?
More information