Big Data Analytics CP3620

Size: px
Start display at page:

Download "Big Data Analytics CP3620"

Transcription

1 Big Data Analytics CP3620

2 Big Data Some facts: 2.7 Zettabytes (2.7 billion TB) of data exists in the digital universe and it s growing. Facebook stores, accesses, and analyzes 30+ Petabytes (1000 TB) of user generated data. Walmart handles more than 1 million customer transactions every hour.

3 Big Data More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide. Decoding the human genome originally took 10 years to process; now it can be achieved in one week. In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a day.

4 Big Data Most of the data generated (text messages, web sites, tweets, music, videos etc.) is not well structured in terms of fields, types etc. Lack of structure makes it difficult to analyze. Therefore - Big Data Analytics

5 Big Data 300 Structured: Relational Non structured: text, audio, video, images Data Volume

6 Importance In 1998, Merrill Lynch cited a rule of thumb: Somewhere around 80-90% of all potentially usable business information may originate in unstructured form.

7 Big Data The term has been in use since the 1990s. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.

8 Big Data In 2012, Gartner group, updated its definition as follows: "Big Data is high-volume, high-velocity and/or high-variety information assets that demand costeffective, innovative forms of information processing that enable enhanced insight, decision making, and process automation."

9 Big Data Characteristics: Volume The quantity of generated and stored data. The size of the data determines the value and potential insight. Variety The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.

10 Big Data Velocity The speed at which the data is generated and processed. Variability Inconsistency of the data set can hamper processes to handle and manage it. Veracity The data quality of captured data can vary greatly, affecting the accurate analysis.

11 Use of Big Data Demand, Supply, Planning Risk management, Fraud, Trades Brand perception, Experience Genome mapping, Diagnostics

12 Architecture Big data repositories have existed in many forms, often built by corporations with a special need. Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system.

13 Architecture In 2000, Seisint Inc. (now LexisNexis Group) developed a C++ based distributed file-sharing framework for data storage and query. In 2004, Google published a paper on a process called MapReduce that uses a similar architecture. An implementation of the MapReduce framework was adopted by an Apache open-source project named Hadoop.

14 Some of the companies using Hadoop

15 Hadoop Hadoop is a distributed framework that can be used to process large data sets that reside in clusters of computers. Because it is a framework, Hadoop is not a single technology or product. Instead, Hadoop is made up of four core modules.

16 Hadoop Architecture MapReduce Framework Yarn Infrastructure HDFS Federation Hadoop Common Cluster

17 Hadoop Architecture 1. The Cluster is the set of host machines (nodes). This is the hardware part of infrastracture. 2. Hadoop YARN (Yet Another Resource Negotiator) Provides the framework to schedule jobs and manage resources across the cluster that holds the data.

18 Hadoop Architecture Two important Yarn elements are: A. Resource manager (one per cluster) is the master. It knows where slaves are located and how many resources they have. Resource Manager Resource Scheduler Application Master Liveness Monitor Node Manager Liveness Monitor Several Event Handlers

19 Hadoop Architecture B. Node manager (many per cluster) is the slave. Container is a fraction of the NM capacity and it is used by the client for running a program. Node Manager Container #1 Container #2 Container #3 Container #4

20 Hadoop Architecture 3. Hadoop Distributed File System (HDFS) Provides access to application data. 4. Hadoop MapReduce A YARN-based parallel processing system for large data sets. 5. Hadoop Common A set of utilities that supports the three other core modules.

21 How Hadoop works Big Data Computer Cluster Block 1 Block 1 Map Result Block 2 Block 2 Map Reduce Block 3 Block 3 Map

22 HDFS Hadoop, by its very nature, is designed to run on multiple nodes, although it can be run on a single node. It runs on multiple nodes by implementing the Hadoop Distributed File System, or HDFS, which is a file system distributed across multiple nodes.

23 HDFS The way the HDFS works is by splitting the nodes into a system. One node is designated as the name node, and the rest become data nodes. Files are not stored on any one node; they are distributed across multiple data nodes and split into chunks of 64MB.

24 MapReduce A core component of how Hadoop works is MapReduce. MapReduce is a programming framework that enables you to run computation across the large amounts of data on the nodes. It s very simple at its core: you need only write two methods: map and reduce, and after that you define a JobConfig.

25 MapReduce The JobConfig is what defines the essential communication between your functions and the HDFS. While there are other parts of MapReduce, an application can be run with just these three components.

26 MapReduce Typically, one JobTracker and one TaskTracker monitor every MapReduce job. The JobTracker s function is to send work to available TaskTrackers in the cluster, and the TaskTracker is what monitors the task being performed.

27 Single-node cluster Hadoop Cluster Hadoop Client Cluster Machine Map/Reduce Agent HDFS Node Name Node

28 Hadoop Installation Setup Java environment 1. Download jdk-8u151-linux-i586.tar.gz at downloads/jdk8-downloads html 2. Move *gz file from Downloads to /opt, unzip, untar and add /opt/jdk1.8.0_151/bin to PATH variable (System wide in /etc/ environment)

29 Hadoop Installation 3. Set JAVA_HOME environment variable in /etc/ environment: JAVA_HOME= /opt/jdk1.8.0_151 Run source /etc/environment to take effect without logging out.

30 Hadoop Installation Add dedicated group and user 1. Create hadoop group: sudo addgroup hadoop 2. Add hduser to hadoop group: sudo adduser --ingroup hadoop hduser (Set password, but leave name, office etc. empty)

31 Hadoop Installation Setup SSH (Hadoop interacts with its nodes via SSH) 1. Install Secure Shell: sudo apt-get install ssh which ssh should return /usr/bin/ssh which sshd should return /usr/sbin/sshd

32 Hadoop Installation 2. Create and Setup SSH Certificates: su hduser Password: ssh-keygen -t rsa -P "" When prompted for filename press Enter. Identification and public key will be saved in.ssh folder as id_rsa and id_rsa.pub

33 Hadoop Installation Install Hadoop 1. Go to and download binary. Become root, move hadoop tar.gz into /usr/local directory and unzip. From /usr/local change owner: sudo chown -R hduser:hadoop hadoop-2.6.5

34 Hadoop Installation Setup Configuration Files 1. As hduser, open ~/.bashrc and append definition of Hadoop variables to the end of that file: #HADOOP VARIABLES START export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin

35 Hadoop Installation export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/ lib/native export HADOOP_OPTS="- Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END Run source ~/.bashrc for change to take effect.

36 Hadoop Installation 2. Hadoop environment variables are defined in: /usr/local/hadoop-2.6.5/etc/hadoop/ hadoop-env.sh make sure that JAVA_HOME is set as in: export JAVA_HOME=${JAVA_HOME}

37 Hadoop Installation 3. Configuration properties are defined in: /usr/local/hadoop-2.6.5/etc/hadoop/coresite.xml Create hadoop temporary directory as root and change its ownership: sudo mkdir -p /app/hadoop/tmp sudo chown hduser:hadoop /app/hadoop/tmp

38 Hadoop Installation Open core-site.xml as hduser and add: <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>a base for other temporary directories.</description> </property>

39 Hadoop Installation <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. </description> </property> </configuration>

40 Hadoop Installation 4. Setup MapReduce configuration file. Folder /usr/local/hadoop/etc/hadoop/ contains mapred-site.xml.template file which has to be renamed to mapred-site.xml

41 Hadoop Installation Open mapred-site.xml as hduser and add: <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>

42 Hadoop Installation 5. Setup HDFS configuration file located in /usr/local/hadoop/etc/hadoop/hdfs-site.xml Create two directories as root and change ownership: sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode sudo chown -R hduser:hadoop /usr/local/hadoop_store

43 Hadoop Installation Open hdfs-site.xml as hduser and add: <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property>

44 Hadoop Installation <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/namenode </value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop_store/hdfs/datanode </value> </property> </configuration>

45 Hadoop Installation 6. Format the new Hadoop Filesystem: hadoop namenode -format Note that the format command should be executed once before we start using Hadoop. If this command is executed again after Hadoop has been used, it'll destroy all the data on the Hadoop file system.

46 Hadoop Installation 7. Start hadoop (as hduser): /usr/local/hadoop-2.6.5/sbin/start-dfs.sh Check if it s running: jps 4960 NameNode 5458 SecondaryNameNode 5599 Jps 5164 DataNode

47 Hadoop Installation 8. Stop hadoop (as hduser): /usr/local/hadoop-2.6.5/sbin/stop-dfs.sh 9. Hadoop s web ui interfaces are at

48 Hadoop Web Interfaces

49 Hadoop Web Interfaces

50 MapReduce preparation The underlying structure of the HDFS filesystem is very different from our normal file systems (64 MB blocks or larger). Additional considerations are: immutable outputs and actual coding.

51 MapReduce Job Loading Files File System Output You Define 64 MB Native File system Input, Map, Reduce, Output HDFS Immutable Java 128 MB Cloud key-value pairs

52 Running MapReduce job Hadoop comes with several examples. They are located in /usr/local/hadoop-2.6.5/share/ hadoop/mapreduce directory. The following example calculates the value of PI: hadoop jar /usr/local/hadoop-2.6.5/ share/hadoop/mapreduce/hadoopmapreduce-examples jar pi 2 5

53 Running MapReduce job Job finished in seconds Estimated value of Pi is We can vary the number of map tasks (instead of 2 try 16) and samples (instead of 5 try 1000): Job finished in seconds Estimated value of Pi is

54 HDFS Features It is suitable for the distributed storage and processing. Hadoop provides a command interface to interact with HDFS.

55 HDFS Features The built-in servers of namenode and datanode help users to easily check the status of cluster. Streaming access to file system data. HDFS provides file permissions and authentication.

56 HDFS Architecture Metadata ops Name Node Metadata: /home/hduser/data Client Block ops Read Write Replication Data Node Data Node Data Node Data Node Rack 1 Rack 2

57 Namenode It acts as the master server (commodity hardware, GNU/Linux OS, namenode software) and performs the following tasks: Manages the filesystem namespace Regulates client s access to files Executes file system operations, i.e. runs JobTracker which allocates job to TaskTrackers running on DataNodes

58 Datanode It acts as the slave server (commodity hardware, GNU/Linux OS, datanode software) and performs the following tasks: Read-write operations on file system as per client s request Block creation, deletion and replication as per namenode s request as well as TaskTracker

59 Blocks User data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and stored in individual data nodes. These file segments are called blocks (default size is 64MB)

60 HDFS shell commands Create directory hadoop fs -mkdir <paths> Example: hadoop fs -mkdir /user/hduser/bible-output List the contents of a directory hadoop fs -ls <args> Example: hadoop fs -ls /user/hduser/bible-output

61 HDFS shell commands Copy file from/to local FS to HDFS hadoop fs -copyfromlocal <local dir> <HDFS dir> Example: hadoop fs -copyfromlocal /home/branko/downloads/bible.txt /user/hduser hadoop fs -copytolocal /user/hduser/bible-output /home/hduser/bible-output

62 HDFS shell commands See contents of a file hadoop fs -cat <path[filename]> Example: hadoop fs -cat /user/hduser/bible-output/part-r Move file from source to destination hadoop fs -mv <src> <dest> Example: hadoop fs -mv /user/hduser/dir1/abc.txt /user/hduser/dir2

63 HDFS shell commands Remove a file or directory hadoop fs -rm <arg> Example: hadoop fs -rm /user/hduser/dir1/abc.txt Recursive version of remove hadoop fs -rmr <arg> Example: hadoop fs -rmr /user/hduser/dir1

64 wordcount MapReduce job WordCount is a simple application that counts the number of occurrences of each word in a given input set. First of all, copy bible.txt into /user/hduser: hadoop fs -copyfromlocal /home/branko/downloads/bible.txt /user/hduser

65 wordcount MapReduce job If you run: hadoop fs -ls /user/hduser/ You should see: Found 1 items -rw-r r-- 1 hduser supergroup :32 /user/hduser/bible.txt

66 wordcount MapReduce job Run mapreduce job as: hadoop jar /usr/local/hadoop-2.6.5/share/ hadoop/mapreduce/hadoop-mapreduceexamples jar wordcount /user/hduser/ bible.txt /user/hduser/bible-output Observe that directory is created /user/hduser/bible-output

67 wordcount MapReduce job Copy output to local fs: hadoop fs -copytolocal /user/hduser/bible-output /home/hduser/bible-output MapReduce job output is in the file part-r To find the 500 most frequent words, run: cat part-r sort -n -k2 -r head -n500 > Top-500.txt

68 Mapreduce MapReduce is a processing technique for Hadoop s distributed computing. The MapReduce algorithm contains two important tasks: Map and Reduce.

69 Mapreduce Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Reduce takes the output from a map as an input and combines those data tuples into a smaller set of tuples.

70 Inputs and Outputs (Java) The MapReduce framework operates exclusively on <key,value> pairs. Input to the job is a set of <key,value> pairs as well as the output, conceivably of different types.

71 Inputs and Outputs (Java) The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.

72 Inputs and Outputs (Java) Input Output Map <k1,v1> List(<k2,v2>) Reduce <k2, List(v2)> List(<k3,v3>)

73 Mapreduce Input Hello Hadoop Goodbye Hadoop Splitting k1,v1 Hello Hadoop Goodbye Hadoop Mapping List(k2,v2) Hello, 1 Hadoop, 1 Goodbye, 1 Hadoop, 1 Shuffling k2,list(v2) Goodbye, (1) Hadoop, (1,1) Hello, (1) Reducing Goodbye, 1 Hadoop, 2 Hello, 1 Output List(k3, v3) Goodbye, 1 Hadoop, 2 Hello, 1

74 WordCount.java In order to compile Java source, we need to set up environment variable HADOOP_CLASSPATH in hduser s ~/.bashrc: export HADOOP_CLASSPATH=${JAVA_HOME}/ lib/tools.jar

75 WordCount.java Compile with: hadoop com.sun.tools.javac.main WordCount.java Three classes are created: -rw-r--r-- 1 hduser hadoop -rw-r--r-- 1 hduser hadoop -rw-r--r-- 1 hduser hadoop -rw-r--r-- 1 hduser hadoop 1491 Jan 8 14:41 WordCount.class 1739 Jan 8 14:41 WordCount$IntSumReducer.class 2089 Jan 8 14:36 WordCount.java 1736 Jan 8 14:41 WordCount$TokenizerMapper.class

76 WordCount.java Create a jar: jar cf wc.jar WordCount*.class Create directory: hadoop fs -mkdir /user/hduser/input Create file test.txt (Hello Hadoop Goodbye Hadoop)

77 WordCount.java Copy file to hdfs: hadoop fs -copyfromlocal /home/hduser/test.txt /user/hduser/input Run: hadoop jar wc.jar WordCount /user/hduser/input /user/hduser/output

78 WordCount.java Copy file to local fs: hadoop fs -copytolocal /user/hduser/output /home/hduser/tmp/ output Output can be found in part-r file: Goodbye 1 Hadoop 2 Hello 1

79 WordCount.java import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat;

80 WordCount.java public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } }

81 WordCount.java public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { } } int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result);

82 WordCount.java } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); }

83 class Mapper Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> Maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks which transform input records into a intermediate records.

84 class Mapper The framework then calls the map method for each line in the input. Individual words are extracted through StringTokenizer. In our example, map method implements the algorithm (in pseudo-code): foreach word w in line emit (word, 1)

85 class Mapper That is, method map takes input text and tokenizes it into tuples (key,value), where the key is each individual word and value is constant (1). private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } }

86 class Reducer Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> Reduces a set of intermediate values which share a key to a smaller set of values. Reducer has 3 primary phases:

87 class Reducer 1. Shuffle The Reducer copies the sorted output from each Mapper using HTTP across the network. 2. Sort The framework merge sorts Reducer inputs by keys. The shuffle and sort phases occur simultaneously.

88 class Reducer 3. Reduce In this phase the reduce(object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs. The output of the reduce task is typically written to a RecordWriter via write(object, Object).

89 class Reducer In our example, reduce method implements the algorithm (in pseudo-code): sum = 0 foreach v in values: sum = sum + v emit (word, sum)

90 class Reducer private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { } int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result);

91 main() method The following properties need to be setup in the main() method: 1. Output key class 2. Output value class 3. Mapper class 4. Reducer class 5. Input format 6. Output format 7. Input file path 8. Output file path

92 main() method } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); }

93 WordCount - multiple files Input to MapReduce job is not limited to a single file. Create another file (test1.txt Hello World Godbye World) and move it to /user/hduser: fs -copyfromlocal /home/hduser/test1.txt /user/hduser/input Remove output directory and run again.

94 Tabular data Our next example is going to use data set (national pollutant release inventory, csv format at 22abff18-6f9d-4926-b7de-3a80c178bf95) We want to run mapreduce job to find out which province released the largest amount of pollutants.

95 Tabular data Since Hadoop reads input one line at a time, we don t need StringTokenizer. As soon as we get new token (line), we need to split it into array of Strings: String temp = value.tostring(); String[] air = temp.split(",");

96 Tabular data String at index 6 is province (key) and string at index 19 is total release of pollutants in tonnes (value). String a = air[6].replaceall("\"", ""); String b = air[19].replaceall("\"", "");

97 Regular Expressions public String[] split(string regex) This method splits string into array of strings around matches of the given regular expression. Regex or Regular Expression is a way to describe a set of strings based on common characteristics shared by each string in the set.

98 Regular Expressions For our data set, simple regular expression (, ) is not going to work since comma is not only used to separate fields, but also inside fields. For example, look at the address in second line: Site 4, Box 1, RR1 We ll need more complex regex to accept commas between fields, but ignore them within fields.

99 Regular Expressions Regular expressions define sets of strings that share the common pattern. They can be used for searching, extracting and modifying text. Basic regex constructs include character classes, quantifiers, boundaries and groupings.

100 Character Classes Character classes are used to define the content of the pattern. E.g. what should the pattern look for? Symbol Description. Any character \d A digit [0-9] \D A non-digit [^0-9] \s A whitespace character [ \t\n\x0b\r\f] \S A non-whitespace character [^\s] \w A word character [a-za-z_0-9] \W A word character [^\w]

101 Character Classes Consider the following regular expression: String regex = "H[ea]llo"; The set of strings defined bey regex is {Hello, Hallo}

102 Quantifiers Quantifiers can be used to specify the length that part of a pattern should match. A quantifier will bind to the expression group to its immediate left. Symbol Description * Match 0 or more times + Match 1 or more times? Match 1 or 0 times {n} Match exactly n times {n,} Match at least n times {n,m} Match at least n times but not more than m times

103 Quantifiers Consider the following regular expression: String regex = "Hallo{2,4}"; The set of strings defined by regex is {Helloo,Hellooo,Helloooo}

104 Boundaries A boundary could be the beginning of a string, the end of a string, the beginning of a word etc. Symbol Description ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word boundary \A The beginning of input \G The end of the previous match \z The end of input

105 Boundaries The following code excerpt extracts words beginning with letter l : String text = "Mary had a little lamb"; Pattern pattern = Pattern.compile("\\bl\\w+\\b"); Matcher matcher = pattern.matcher(text); while(matcher.find()){ System.out.println(matcher.group()); }

106 Groups A group is a captured subsequence of characters which may be used later in the expression with a backreference. Symbol Description () Defines a group \N Refers to a matched groupd

107 Groups The following code excerpt extracts matched words from the specified group of words: String input = "I have a cat, but I like my dog better."; Pattern p = Pattern.compile( "(mouse cat dog wolf bear human)"); Matcher m = p.matcher(input); while (m.find()) { System.out.println(m.group()); }

108 Pattern and Matcher classes Class java.util.regex.pattern is a compiled representation of a regular expression. A regular expression, specified as a string, must first be compiled into an instance of this class.

109 Pattern and Matcher classes The resulting pattern can then be used to create a java.util.regex.matcher object that can match arbitrary character sequences against the regular expression. A typical invocation sequence is thus: Pattern p = Pattern.compile("a*b"); Matcher m = p.matcher("aaaaab"); boolean b = m.matches();

110 Pattern and Matcher classes Most commonly used Pattern methods: Pattern compile(string regex) Compiles the given regular expression into a pattern. String pattern() Returns the regular expression from which this pattern was compiled.

111 Pattern and Matcher classes String[] split(charsequence input) Splits the given input sequence around matches of this pattern. Matcher matcher(charsequence input) Creates a matcher that will match the given input against this pattern.

112 Pattern and Matcher classes Most commonly used Matcher methods: boolean matches() Attempts to match the entire region against the pattern. boolean find() Attempts to find the next subsequence of the input sequence that matches the pattern.

113 Pattern and Matcher classes int start() Returns the start index of the previous match. int end() Returns the o set after the last character matched. String group() Returns the input subsequence matched by the previous match.

114 Lookaround regex The sole purpose of regular expressions is to decide whether a string matches or contains a certain pattern. But sometimes we have the condition that this pattern is preceded or followed by another pattern. Lookaround regex comes handy when we don t want these conditions to be part of the match.

115 Positive lookahead (?=) A(?=B) Find expression A where expression B follows. Example: find first occurrence of substring xyz that must be immediately followed by the same substring. String regex = "xyz(?=xyz)"; Pattern pattern = Pattern.compile(regex); String s = "abcxyzxyzabc"; Matcher matcher = pattern.matcher(s);

116 Negative lookahead (?!) A(?!B) Find expression A where expression B does not follow. Example: find first occurrence of substring xyz that must NOT be immediately followed by the same substring. String regex = "xyz(?!xyz)"; Pattern pattern = Pattern.compile(regex); String s = "abcxyzxyzabc"; Matcher matcher = pattern.matcher(s);

117 Positive lookbehind (?<=) (?<=B)A Find expression A where expression B precedes. Example: find first occurrence of substring xyz that must be immediately preceded by substring abc. String regex = "(?<=abc)xyz"; Pattern pattern = Pattern.compile(regex); String s = "abcxyzxyzabc"; Matcher matcher = pattern.matcher(s);

118 Negative lookbehind (?<!) (?<!B)A Find expression A where expression B does NOT precede. Example: find first occurrence of substring xyz that must not be immediately preceded by substring abc. String regex = "(?<!abc)xyz ; Pattern pattern = Pattern.compile(regex); String s = "abcxyzxyzabc"; Matcher matcher = pattern.matcher(s);

119 Lookarounds combined Lookaround expressions can be combined. For example, regex: String regex = "(?<=abc)xyz(?=xyz)"; Lookaround will match the first occurence of xyz in abcxyzxyzabc, since it must be preceded by abc and followed by xyz.

120 Java String regex methods Java String class has several regular expression methods too. boolean matches(string regex) Tells whether or not this string matches the given regular expression. String s = "one two three two one"; boolean m = s.matches(".*two.*"); // true

121 Java String regex methods String[] split(string regex) Splits this string around matches of the given regular expression. String s = "one two three two one"; String[] t = s.split("two"); // t[0] = "one",t[1] = " three",t[2] = " one "

122 Java String regex methods String replacefirst(string regex, String replacement) Replaces the first substring of this string that matches the given regular expression with the given replacement. String s = "one two three two one"; String t = s.replacefirst("two", "2"); // t = one 2 three two one"

123 Java String regex methods String replaceall(string regex, String replacement) Replaces each substring of this string that matches the given regular expression with the given replacement. String s = "one two three two one"; String t = s.replaceall("two", "2"); // t = one 2 three 2 one"

124 Comma separated sequence Back to our example, where we need to split string "one","two","1,2,3" into strings: "one", "two" and "1,2,3". In the first attempt, one might be tempted to write: String reg = ","; String str = "\"one\",\"two\",\"1,2,3\""; String[] rp = str.split(reg);

125 Comma separated sequence But this will result in 5 strings: "one", "two", "1, 2 and 3". If we change regex to "\",\"", i.e. split string if and only if comma is preceded and succeded by double quotes, we ll get: "one, two and 1,2,3"

126 Comma separated sequence We could replace all occurences of double quotes by empty string (rp[i].replaceall("\"","")) and we re done. But, unfortunately, all csv formats are not the same. Add spaces around commas and run again. Input string won t be split at all since there is no pattern match.

127 Comma separated sequence We could change regex to "\" *, *\"", to include any number of spaces preceding and following comma between double quotes. This would (sort of) do the job, but we would still not get strings: "one", "two" and "1,2,3" Instead, we would get: "one, two and 1,2,3"

128 Comma separated sequence Our original criterion (split input on comma, but ignore commas within double quotes) can be rephrased: Split on the comma only if that comma has zero, or an even number of quotes ahead of it.

129 Comma separated sequence "one", "two", "1,2,3" For this we need lookahead construct: " *, *(?=([^\"]*\"[^\"]*\")*[^\"]*$)";

130 PolluterCount.java import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.doublewritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.commons.lang.math.numberutils;

131 PolluterCount.java public class PolluterCount { public static class TokenizerMapper extends Mapper <Object, Text, Text, DoubleWritable> { private DoubleWritable val = new DoubleWritable(); private Text word = new Text();

132 PolluterCount.java } public void map(object key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); String[] air = line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"); if(air.length > 1) { String a = air[6].replaceall("\"", ""); String b = air[19].replaceall("\"", ""); if(numberutils.isnumber(b)) { double d = Double.parseDouble(b); val.set(d); word.set(a); context.write(word, val); } } }

133 PolluterCount.java public static class DoubleSumReducer extends Reducer<Text,DoubleWritable,Text,DoubleWritable> { } private DoubleWritable result = new DoubleWritable(); public void reduce(text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException { } double sum = 0; for (DoubleWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result);

134 PolluterCount.java } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "polluter count"); job.setjarbyclass(pollutercount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(doublesumreducer.class); job.setreducerclass(doublesumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(doublewritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); }

135 PolluterCount.java Compile: hadoop com.sun.tools.javac.main PolluterCount.java Create jar: jar cf pc.jar PolluterCount*.class Create directory: hadoop fs -mkdir /user/hduser/ pcinput Copy file to hdfs: hadoop fs -copyfromlocal /home/hduser/air_data.csv /user/hduser/pcinput

136 PolluterCount.java Run: hadoop jar pc.jar PolluterCount /user/hduser/pcinput/air_data.csv /user/hduser/pcoutput Copy file to fs: hadoop fs -copytolocal /user/hduser/pcoutput /home/hduser Sort: cat part-r sort -n -k2 -r > sorted.txt

137 PolluterCount.java part-r sorted Province Tonnes emitted Province Tonnes emitted AB BC MB NB NL NS NT NU ON PE QC SK AB ON QC BC MB SK NB NS NL PE NT NU

138 Apache log file format (common access) Analysis of large Apache web server log files is one of many hadoop applications. Apache log entry might look like: [01/Jul/1995:00:00: ] "GET /history/apollo/ HTTP/1.0"

139 Apache log file format (common access) The format is: "%h %l %u %t \"%r\" %>s %b" where: 1. %h is the remote host (client IP) 2. %l is identity of the user determined by identd 3. %u is identity of the user determined by HTTP authentication

140 Apache log file format (common access) 4. %t is the time the request was received 5. \"%r\" is the request line from the client ("GET /history/apollo/ HTTP/1.0") 6. %>s is the status code sent from the server to the client (200, 404 etc.) 7. %b is the size of the response to the client (in bytes)

141 Apache log file format (common access) In order to parse these logs we need to define pattern: String apachepattern = "^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(\\S+) (\\S+) (\\S+)\" (\\d{3}) (\\d+)"; See LogParser.java for details.

142 Hadoop Ecosystem Apache Oozie (Workflow scheduler) Hive (SQL) Pig Latin (Scripting) Mahout (Machine Learning) HBase MapReduce HDFS Import/export non-structured/structured data Flume (logs) Sqoop (RDBMS)

143 Hadoop ecosystem MapReduce and HDFS is all we need to process large amount of data. However, this processing is batch processing - prepare input, process it and analyze output. For real (or near real) time processing one needs HBase.

144 HBase HBase is a non-relational distributed database modelled after Google s Bigtable. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop. Unlike relational databases, HBase does not support SQL scripting - oops!

145 Hive That s where Hive comes in. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Example: SELECT MAX(col_name) AS label FROM table;

146 Pig Latin Don t like writing MapReduce jobs in Java? Try Pig Latin. It is a scripting language that abstracts the programming from the Java MapReduce idiom. Example: words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

147 Mahout How to provide recommendations (e.g. movies you might be interested in on Netflix)? Detect mail spam? Auto-organize new content? Apache Mahout uses Hadoop to produce free implementation of distributed machine learning algorithms.

148 Apache Oozie is a workflow scheduler system to manage Apache Hadoop jobs. It is integrated with the rest of Hadoop stack supporting Java MapReduce, Hive, Pig, Sqoop etc.

149 Flume Apache Flume is a distributed service for efficiently collecting, aggregating, and moving large amounts of log data.

150 Sqoop Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. Imports can also be used to populate tables in HBase.

151 HBase Major deficiency of a file system (HDFS) is that it does not support random read/write operations in real time. HBase is non-relational (relationships between tables are not supported) database system modelled after Google s Bigtable.

152 HBase It is a distributed storage system on top of HDFS designed to scale to petabytes of data and thousands of machines. Physically, HBase is composed of three types of servers in a master slave type of architecture: HMaster, RegionServers and Zookeper.

153 HBase RegionServers serve data for reads and writes. When accessing data, clients communicate with HBase RegionServers directly. Region assignment, DDL (create, delete tables) operations are handled by the HBase Master process. Zookeeper, which is part of HDFS, maintains a live cluster state.

154 HBase Zookeeper Register HMaster HMaster Lookup Master Read Client Write Assign Regions to RegionServers Region Region Region Region RegionServer 1 RegionServer 2

155 HBase Data Model Applications store data in HBase tables. Tables are made of rows and columns. All columns in HBase belong to a particular column family. Table cells are versioned (time-stamped). Cell s content is uninterpreted array of bytes.

156 HBase Data Model Table row keys are also byte arrays. Rows in HBase tables are sorted by row key. The sort is byte-ordered. All table accesses are via the table row key - its primary key.

157 Conceptual View Table below contains two column families named contents and anchor. In this example, anchor contains two columns (anchor:cssnsi.com, anchor:my.look.ca) and contents contains one column (contents:html). RowKey Time Stamp CF1:contents CF2:anchors com.cnn.www t9 anchor:cnnsi.com="cnn" com.cnn.www t8 anchor:my.look.ca= CNN.com com.cnn.www t6 contents:html= <html> com.cnn.www t5 contents:html= <html> com.cnn.www t3 contents:html= <html>

158 Physical View Although at a conceptual level tables may be viewed as a sparse set of rows. Physically they are stored on a per-column family basis. RowKey Time Stamp CF2:anchors com.cnn.www t9 anchor:cnnsi.com="cnn" com.cnn.www t8 anchor:my.look.ca= CNN.com RowKey Time Stamp CF1:contents com.cnn.www t6 contents:html= <html> com.cnn.www t5 contents:html= <html> com.cnn.www t3 contents:html= <html>

159 HBase Installation Download hbase bin.tar.gz and unzip tar -xvzf hbase bin.tar.gz Move directory: sudo mv hbase /usr/local/ hbase From /user/local, change ownership as root: chown -R hduser:hadoop hbase-1.3.1

160 HBase Installation As hduser open ~/.bashrc and add: #HBASE VARIABLES START export HBASE_HOME=/usr/local/hbase export PATH=$PATH:$HBASE_HOME/bin #HBASE VARIABLES END Run: source ~/.bashrc

161 HBase Installation Setup hbase-env.sh in /usr/local/ hbase-1.3.1/conf Uncomment export JAVA_HOME and set it to: export JAVA_HOME=/opt/jdk1.8.0_151 Add: export HBASE_REGIONSERVERS=${HBASE_HOME}/ conf/regionservers export HBASE_MANAGES_ZK=true

162 HBase Installation Setup hbase-site.xml in /usr/local/ hbase-1.3.1/conf <configuration> <property> <name>hbase.rootdir</name> <value>hdfs://localhost:54310/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property>

163 HBase Installation <property> <name> hbase.zookeeper.property.clientport </name> <value>2222</value> </property> <property> <name> hbase.zookeeper.property.datadir </name> <value>/home/hduser/zookeeper</value> </property> </configuration>

164 HBase Installation Start HBase: start-hbase.sh See if started: jps 6098 HMaster 6498 Jps 6215 HRegionServer 4951 SecondaryNameNode 4456 NameNode 6040 HQuorumPeer 4667 DataNode

165 HBase Shell Start HBase shell: hbase shell hbase(main):001:0> Check status: hbase(main):001:0> status 1 active master, 0 backup masters, 1 servers, 0 dead, average load

166 HBase Shell Create table: hbase(main):003:0> create 'test', 'cf' 0 row(s) in seconds => Hbase::Table - test

167 HBase Shell List table info: hbase(main):004:0> list 'test' TABLE test 1 row(s) in seconds

168 HBase Shell Populate table: hbase(main):005:0> put 'test', 'row1', 'cf:a', 'value1' 0 row(s) in seconds hbase(main):006:0> put 'test', 'row2', 'cf:b', 'value2' 0 row(s) in seconds hbase(main):007:0> put 'test', 'row3', 'cf:c', 'value3' 0 row(s) in seconds

169 HBase Shell Scanning table for all data at once: hbase(main):008:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp= , value=value1 row2 column=cf:b, timestamp= , value=value2 row3 column=cf:c, timestamp= , value=value3 3 row(s) in seconds

170 HBase Shell Get a single row of data: hbase(main):009:0> get 'test', 'row1' COLUMN CELL cf:a timestamp= , value=value1 1 row(s) in seconds Disable HBase table: hbase(main):010:0> disable 'test' Enable HBase table: hbase(main):010:0> enable 'test'

171 HBase Shell Drop HBase table: hbase(main):009:0> drop test' Table must first be disabled. Exit Hbase shell: hbase(main):010:0> quit Stop HBase: stop-hbase.sh

172 HBase as source/sink HBase tables can be used as source/sink to MapReduce jobs. Source is defined as: TableMapReduceUtil.initTableMapperJob( "sourcetable", //input table scan, //scan to control cf MyMapper.class, //mapper class null, //mapper output key null, //mapper output value job);

173 HBase as source/sink Sink is defined as: TableMapReduceUtil.initTableReducerJob( "targettable", //output table MyReducer.class, //reducer class job);

174 TableMapper Class TableMapper<KEYOUT,VALUEOUT> Where KEYOUT and VALUEOUT are types of the output key and the value respectively. Method map() is inherited from the superclass Mapper.

175 TableReducer Class TableReducer<KEYIN,VALUEIN,KEYOUT> Where KEYIN, VALUEIN and KEYOUT are types of the input/output keys and the input value respectively.

176 TableReducer While the input key and value as well as the output key can be anything handed in from the previous map phase, the output value must be either a Put or a Delete. Method reduce() is inherited from the superclass Reducer.

177 HBase as source/sink example Our next example is going to use HBase table test1 that stores dates and sales and populate table test2 with total sales for each day. Date format is YYYYMMDD#n, where n is the sale counter for particular date.

178 HBase as source/sink example test1 RowKey Time Stamp CF1:sales #1 t #2 t #1 t #2 t4 210 test2 RowKey Time Stamp CF1:sum t t6 410

179 HBaseCount.java Create tables: hbase(main):003:0> create 'test1', 'cf1' 0 row(s) in seconds => Hbase::Table - test1 hbase(main):004:0> create 'test2', 'cf1' 0 row(s) in seconds => Hbase::Table - test2

180 HBase Shell Populate table test1: hbase(main):005:0> put 'test1',' #1','cf1:sales','100' 0 row(s) in seconds hbase(main):006:0> put 'test1',' #2','cf1:sales','110' 0 row(s) in seconds hbase(main):007:0> put 'test1',' #1','cf1:sales','200' 0 row(s) in seconds hbase(main):008:0> put 'test1',' #2','cf1:sales','210' 0 row(s) in seconds

181 HBaseCount.java import java.io.ioexception; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; //HDBASE import org.apache.hadoop.hbase.mapreduce.tablemapper; import org.apache.hadoop.hbase.mapreduce.tablereducer; import org.apache.hadoop.hbase.mapreduce.tablemapreduceutil; import org.apache.hadoop.hbase.client.scan; import org.apache.hadoop.hbase.client.put; import org.apache.hadoop.hbase.client.result; import org.apache.hadoop.hbase.util.bytes; import org.apache.hadoop.hbase.io.immutablebyteswritable;

182 HBaseCount.java public class HBaseCount { public static class HBaseMapper extends TableMapper <Text, IntWritable> { private Text text = new Text(); public void map(immutablebyteswritable rowkey, Result columns, Context context) throws IOException, InterruptedException { String inkey = new String(rowKey.get()); String okey = inkey.split("#")[0]; text.set(okey); byte[] bsales = columns.getvalue(bytes.tobytes("cf1"), Bytes.toBytes("sales")); String ssales = new String(bSales); Integer sales = new Integer(sSales); context.write(text, new IntWritable(sales)); } }

183 HBaseCount.java public static class HBaseReducer extends TableReducer <Text, IntWritable, ImmutableBytesWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable sales : values) { Integer intsales = new Integer(sales.toString()); sum += intsales; } Put inshbase = new Put(key.getBytes()); inshbase.addcolumn(bytes.tobytes( cf1"), Bytes.toBytes( sum"), Bytes.toBytes(sum)); context.write(null, inshbase); } }

184 HBaseCount.java } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Scan scan = new Scan(); scan.addfamily(bytes.tobytes("cf1")); Job job = Job.getInstance(); job.setjarbyclass(hbasecount.class); } TableMapReduceUtil.initTableMapperJob( "test1", scan, HBaseMapper.class, Text.class, IntWritable.class, job); TableMapReduceUtil.initTableReducerJob( "test2", HBaseReducer.class, job); job.waitforcompletion(true);

185 HBaseCount.java Modify: /home/hduser/.bashrc HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar #HBASE VARIABLES START export HBASE_HOME=/usr/local/hbase export PATH=$PATH:$HBASE_HOME/bin HADOOP_CLASSPATH="$HADOOP_CLASSPATH: $HBASE_HOME/lib/*" #HBASE VARIABLES END export HADOOP_CLASSPATH

186 HBaseCount.java Refresh: source.bashrc Compile: javac -cp $(hbase classpath):$(hadoop classpath) HBaseCount.java Create jar: jar cf hb.jar HBaseCount*.class Run: hadoop jar hb.jar HBaseCount

187 HBaseCount.java Results in hbase shell: hbase(main):003:0> scan 'test2' ROW COLUMN+CELL column=cf1:sum, timestamp= , value=\x00\x00\x00\xd column=cf1:sum, timestamp= , value=\x00\x00\x01\x9a 2 row(s) in seconds hbase(main):001:0> org.apache.hadoop.hbase.util.bytes.toint("\x00\x00\x00\xd2".to_java_bytes) => 210 hbase(main):004:0> org.apache.hadoop.hbase.util.bytes.toint("\x00\x00\x01\x9a".to_java_bytes) => 410

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example

More information

Part II (c) Desktop Installation. Net Serpents LLC, USA

Part II (c) Desktop Installation. Net Serpents LLC, USA Part II (c) Desktop ation Desktop ation ation Supported Platforms Required Software Releases &Mirror Sites Configure Format Start/ Stop Verify Supported Platforms ation GNU Linux supported for Development

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming) Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 16 Big Data Management VI (MapReduce Programming) Credits: Pietro Michiardi (Eurecom): Scalable Algorithm

More information

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc. D. Praveen Kumar Junior Research Fellow Department of Computer Science & Engineering Indian Institute of Technology (Indian School of Mines) Dhanbad, Jharkhand, India Head of IT & ITES, Skill Subsist Impels

More information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014 Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

More information

3 Hadoop Installation: Pseudo-distributed mode

3 Hadoop Installation: Pseudo-distributed mode Laboratory 3 Hadoop Installation: Pseudo-distributed mode Obecjective Hadoop can be run in 3 different modes. Different modes of Hadoop are 1. Standalone Mode Default mode of Hadoop HDFS is not utilized

More information

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of

More information

Java in MapReduce. Scope

Java in MapReduce. Scope Java in MapReduce Kevin Swingler Scope A specific look at the Java code you might use for performing MapReduce in Hadoop Java program recap The map method The reduce method The whole program Running on

More information

MapReduce and Hadoop. The reference Big Data stack

MapReduce and Hadoop. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The

More information

Chapter 3. Distributed Algorithms based on MapReduce

Chapter 3. Distributed Algorithms based on MapReduce Chapter 3 Distributed Algorithms based on MapReduce 1 Acknowledgements Hadoop: The Definitive Guide. Tome White. O Reilly. Hadoop in Action. Chuck Lam, Manning Publications. MapReduce: Simplified Data

More information

Installation of Hadoop on Ubuntu

Installation of Hadoop on Ubuntu Installation of Hadoop on Ubuntu Various software and settings are required for Hadoop. This section is mainly developed based on rsqrl.com tutorial. 1- Install Java Software Java Version* Openjdk version

More information

Guidelines For Hadoop and Spark Cluster Usage

Guidelines For Hadoop and Spark Cluster Usage Guidelines For Hadoop and Spark Cluster Usage Procedure to create an account in CSX. If you are taking a CS prefix course, you already have an account; to get an initial password created: 1. Login to https://cs.okstate.edu/pwreset

More information

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2 Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

UNIT II HADOOP FRAMEWORK

UNIT II HADOOP FRAMEWORK UNIT II HADOOP FRAMEWORK Hadoop Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.

More information

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.

More information

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018 Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018 abstract In 2017 the Machine Learning research group got funding for a new Hadoop cluster. However,

More information

Big Data Retrieving Required Information From Text Files Desmond Hill Yenumula B Reddy (Advisor)

Big Data Retrieving Required Information From Text Files Desmond Hill Yenumula B Reddy (Advisor) Big Data Retrieving Required Information From Text Files Desmond Hill Yenumula B Reddy (Advisor) 1 OUTLINE Objective What is Big data Characteristics of Big Data Setup Requirements Hadoop Setup Word Count

More information

Hadoop Setup on OpenStack Windows Azure Guide

Hadoop Setup on OpenStack Windows Azure Guide CSCI4180 Tutorial- 2 Hadoop Setup on OpenStack Windows Azure Guide ZHANG, Mi mzhang@cse.cuhk.edu.hk Sep. 24, 2015 Outline Hadoop setup on OpenStack Ø Set up Hadoop cluster Ø Manage Hadoop cluster Ø WordCount

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak

Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak Cloud Programming on Java EE Platforms mgr inż. Piotr Nowak dsh distributed shell commands execution -c concurrent --show-machine-names -M --group cluster -g cluster /etc/dsh/groups/cluster needs passwordless

More information

2. MapReduce Programming Model

2. MapReduce Programming Model Introduction MapReduce was proposed by Google in a research paper: Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System

More information

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi Session 1 Big Data and Hadoop - Overview - Dr. M. R. Sanghavi Acknowledgement Prof. Kainjan M. Sanghavi For preparing this prsentation This presentation is available on my blog https://maheshsanghavi.wordpress.com/expert-talk-fdp-workshop/

More information

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science Introduction The Hadoop cluster in Computing Science at Stirling allows users with a valid user account to submit and

More information

Installation and Configuration Documentation

Installation and Configuration Documentation Installation and Configuration Documentation Release 1.0.1 Oshin Prem Sep 27, 2017 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE INSTALLATION....................................

More information

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java

MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java Contents Page 1 Copyright IBM Corporation, 2015 US Government Users Restricted Rights - Use, duplication or disclosure restricted

More information

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

ExamTorrent.   Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you ExamTorrent http://www.examtorrent.com Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you Exam : Apache-Hadoop-Developer Title : Hadoop 2.0 Certification exam for Pig

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce) Ghislain Fourny Big Data 6. Massive Parallel Processing (MapReduce) So far, we have... Storage as file system (HDFS) 13 So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14 Data

More information

Getting Started with Hadoop/YARN

Getting Started with Hadoop/YARN Getting Started with Hadoop/YARN Michael Völske 1 April 28, 2016 1 michael.voelske@uni-weimar.de Michael Völske Getting Started with Hadoop/YARN April 28, 2016 1 / 66 Outline Part One: Hadoop, HDFS, and

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce) Ghislain Fourny Big Data Fall 2018 6. Massive Parallel Processing (MapReduce) Let's begin with a field experiment 2 400+ Pokemons, 10 different 3 How many of each??????????? 4 400 distributed to many volunteers

More information

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018 Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much

More information

BIG DATA TRAINING PRESENTATION

BIG DATA TRAINING PRESENTATION BIG DATA TRAINING PRESENTATION TOPICS TO BE COVERED HADOOP YARN MAP REDUCE SPARK FLUME SQOOP OOZIE AMBARI TOPICS TO BE COVERED FALCON RANGER KNOX SENTRY MASTER IMAGE INSTALLATION 1 JAVA INSTALLATION: 1.

More information

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo Vendor: Cloudera Exam Code: CCD-410 Exam Name: Cloudera Certified Developer for Apache Hadoop Version: Demo QUESTION 1 When is the earliest point at which the reduce method of a given Reducer can be called?

More information

50 Must Read Hadoop Interview Questions & Answers

50 Must Read Hadoop Interview Questions & Answers 50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?

More information

Aims. Background. This exercise aims to get you to:

Aims. Background. This exercise aims to get you to: Aims This exercise aims to get you to: Import data into HBase using bulk load Read MapReduce input from HBase and write MapReduce output to HBase Manage data using Hive Manage data using Pig Background

More information

Big Data Analytics: Insights and Innovations

Big Data Analytics: Insights and Innovations International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 6, Issue 10 (April 2013), PP. 60-65 Big Data Analytics: Insights and Innovations

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

Steps: First install hadoop (if not installed yet) by, https://sl6it.wordpress.com/2015/12/04/1-study-and-configure-hadoop-for-big-data/

Steps: First install hadoop (if not installed yet) by, https://sl6it.wordpress.com/2015/12/04/1-study-and-configure-hadoop-for-big-data/ SL-V BE IT EXP 7 Aim: Design and develop a distributed application to find the coolest/hottest year from the available weather data. Use weather data from the Internet and process it using MapReduce. Steps:

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

Compile and Run WordCount via Command Line

Compile and Run WordCount via Command Line Aims This exercise aims to get you to: Compile, run, and debug MapReduce tasks via Command Line Compile, run, and debug MapReduce tasks via Eclipse One Tip on Hadoop File System Shell Following are the

More information

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2. SDJ INFOSOFT PVT. LTD Apache Hadoop 2.6.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.x Table of Contents Topic Software Requirements

More information

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g. Big Data Computing Instructor: Prof. Irene Finocchi Master's Degree in Computer Science Academic Year 2013-2014, spring semester Installing Hadoop Emanuele Fusco (fusco@di.uniroma1.it) Prerequisites You

More information

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters

More information

Introduction to HDFS and MapReduce

Introduction to HDFS and MapReduce Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -

More information

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version : Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.

More information

Big Data Exercises. Fall 2017 Week 5 ETH Zurich. MapReduce

Big Data Exercises. Fall 2017 Week 5 ETH Zurich. MapReduce Big Data Exercises Fall 2017 Week 5 ETH Zurich MapReduce Reading: White, T. (2015). Hadoop: The Definitive Guide (4th ed.). O Reilly Media, Inc. [ETH library] (Chapters 2, 6, 7, 8: mandatory, Chapter 9:

More information

Attacking & Protecting Big Data Environments

Attacking & Protecting Big Data Environments Attacking & Protecting Big Data Environments Birk Kauer & Matthias Luft {bkauer, mluft}@ernw.de #WhoAreWe Birk Kauer - Security Researcher @ERNW - Mainly Exploit Developer Matthias Luft - Security Researcher

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HCatalog

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HCatalog About the Tutorial HCatalog is a table storage management tool for Hadoop that exposes the tabular data of Hive metastore to other Hadoop applications. It enables users with different data processing tools

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Large-scale Information Processing

Large-scale Information Processing Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de Anecdotal evidence... I think there is a world market for about five computers,

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 2: Hadoop Nuts and Bolts Jimmy Lin University of Maryland Thursday, January 31, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014. COSC 6397 Big Data Analytics Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading Edgar Gabriel Spring 2014 Recap on HBase Column-Oriented data store NoSQL DB Data is stored in

More information

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo Vendor: Hortonworks Exam Code: HDPCD Exam Name: Hortonworks Data Platform Certified Developer Version: Demo QUESTION 1 Workflows expressed in Oozie can contain: A. Sequences of MapReduce and Pig. These

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Installing Hadoop / Yarn, Hive 2.1.0, Scala , and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes. By: Nicholas Propes 2016

Installing Hadoop / Yarn, Hive 2.1.0, Scala , and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes. By: Nicholas Propes 2016 Installing Hadoop 2.7.3 / Yarn, Hive 2.1.0, Scala 2.11.8, and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes By: Nicholas Propes 2016 1 NOTES Please follow instructions PARTS in order because the results

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Hadoop Quickstart. Table of contents

Hadoop Quickstart. Table of contents Table of contents 1 Purpose...2 2 Pre-requisites...2 2.1 Supported Platforms... 2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster...3 5 Standalone

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Getting Started with Hadoop

Getting Started with Hadoop Getting Started with Hadoop May 28, 2018 Michael Völske, Shahbaz Syed Web Technology & Information Systems Bauhaus-Universität Weimar 1 webis 2018 What is Hadoop Started in 2004 by Yahoo Open-Source implementation

More information

UNIT-IV HDFS. Ms. Selva Mary. G

UNIT-IV HDFS. Ms. Selva Mary. G UNIT-IV HDFS HDFS ARCHITECTURE Dataset partition across a number of separate machines Hadoop Distributed File system The Design of HDFS HDFS is a file system designed for storing very large files with

More information

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX KillTest Q&A Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Version : DEMO 1 / 4 1.When is the earliest point at which the reduce method of a given Reducer can be called?

More information

CSE6331: Cloud Computing

CSE6331: Cloud Computing CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2017 by Leonidas Fegaras Map-Reduce Fundamentals Based on: J. Simeon: Introduction to MapReduce P. Michiardi: Tutorial on MapReduce

More information

Hadoop is essentially an operating system for distributed processing. Its primary subsystems are HDFS and MapReduce (and Yarn).

Hadoop is essentially an operating system for distributed processing. Its primary subsystems are HDFS and MapReduce (and Yarn). 1 Hadoop Primer Hadoop is essentially an operating system for distributed processing. Its primary subsystems are HDFS and MapReduce (and Yarn). 2 Passwordless SSH Before setting up Hadoop, setup passwordless

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms Map Reduce 1 MapReduce inside Google Googlers' hammer for 80% of our data crunching Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google

More information

CS November 2017

CS November 2017 Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Multi-Node Cluster Setup on Hadoop. Tushar B. Kute,

Multi-Node Cluster Setup on Hadoop. Tushar B. Kute, Multi-Node Cluster Setup on Hadoop Tushar B. Kute, http://tusharkute.com What is Multi-node? Multi-node cluster Multinode Hadoop cluster as composed of Master- Slave Architecture to accomplishment of BigData

More information

INTRODUCTION TO HADOOP

INTRODUCTION TO HADOOP Hadoop INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 2 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = 10 21 Bytes)

More information

Importing and Exporting Data Between Hadoop and MySQL

Importing and Exporting Data Between Hadoop and MySQL Importing and Exporting Data Between Hadoop and MySQL + 1 About me Sarah Sproehnle Former MySQL instructor Joined Cloudera in March 2010 sarah@cloudera.com 2 What is Hadoop? An open-source framework for

More information

CS November 2018

CS November 2018 Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

Hadoop Cluster Implementation

Hadoop Cluster Implementation Hadoop Cluster Implementation By Aysha Binta Sayed ID:2013-1-60-068 Supervised By Dr. Md. Shamim Akhter Assistant Professor Department of Computer Science and Engineering East West University A project

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

A Glimpse of the Hadoop Echosystem

A Glimpse of the Hadoop Echosystem A Glimpse of the Hadoop Echosystem 1 Hadoop Echosystem A cluster is shared among several users in an organization Different services HDFS and MapReduce provide the lower layers of the infrastructures Other

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Hadoop & Big Data Analytics Complete Practical & Real-time Training An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński Introduction into Big Data analytics Lecture 3 Hadoop ecosystem Janusz Szwabiński Outlook of today s talk Apache Hadoop Project Common use cases Getting started with Hadoop Single node cluster Further

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Hadoop Online Training

Hadoop Online Training Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Pattern Hadoop Mix Graphs Giraph Spark Zoo Keeper Spark But first Partitioner & Combiner

More information

HDFS Access Options, Applications

HDFS Access Options, Applications Hadoop Distributed File System (HDFS) access, APIs, applications HDFS Access Options, Applications Able to access/use HDFS via command line Know about available application programming interfaces Example

More information

A brief history on Hadoop

A brief history on Hadoop Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)

More information

itpass4sure Helps you pass the actual test with valid and latest training material.

itpass4sure   Helps you pass the actual test with valid and latest training material. itpass4sure http://www.itpass4sure.com/ Helps you pass the actual test with valid and latest training material. Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Vendor : Cloudera

More information

HBase: Overview. HBase is a distributed column-oriented data store built on top of HDFS

HBase: Overview. HBase is a distributed column-oriented data store built on top of HDFS HBase 1 HBase: Overview HBase is a distributed column-oriented data store built on top of HDFS HBase is an Apache open source project whose goal is to provide storage for the Hadoop Distributed Computing

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

Recommended Literature

Recommended Literature COSC 6397 Big Data Analytics Introduction to Map Reduce (I) Edgar Gabriel Spring 2017 Recommended Literature Original MapReduce paper by google http://research.google.com/archive/mapreduce-osdi04.pdf Fantastic

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Implementing Algorithmic Skeletons over Hadoop

Implementing Algorithmic Skeletons over Hadoop Implementing Algorithmic Skeletons over Hadoop Dimitrios Mouzopoulos E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Computer Science School of Informatics University of Edinburgh 2011

More information