Big Data Analytics CP3620

Size: px

Start display at page:

Download "Big Data Analytics CP3620"

Jocelyn Parker
5 years ago
Views:

1 Big Data Analytics CP3620

2 Big Data Some facts: 2.7 Zettabytes (2.7 billion TB) of data exists in the digital universe and it s growing. Facebook stores, accesses, and analyzes 30+ Petabytes (1000 TB) of user generated data. Walmart handles more than 1 million customer transactions every hour.

3 Big Data More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide. Decoding the human genome originally took 10 years to process; now it can be achieved in one week. In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a day.

4 Big Data Most of the data generated (text messages, web sites, tweets, music, videos etc.) is not well structured in terms of fields, types etc. Lack of structure makes it difficult to analyze. Therefore - Big Data Analytics

5 Big Data 300 Structured: Relational Non structured: text, audio, video, images Data Volume

6 Importance In 1998, Merrill Lynch cited a rule of thumb: Somewhere around 80-90% of all potentially usable business information may originate in unstructured form.

7 Big Data The term has been in use since the 1990s. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.

8 Big Data In 2012, Gartner group, updated its definition as follows: "Big Data is high-volume, high-velocity and/or high-variety information assets that demand costeffective, innovative forms of information processing that enable enhanced insight, decision making, and process automation."

9 Big Data Characteristics: Volume The quantity of generated and stored data. The size of the data determines the value and potential insight. Variety The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.

10 Big Data Velocity The speed at which the data is generated and processed. Variability Inconsistency of the data set can hamper processes to handle and manage it. Veracity The data quality of captured data can vary greatly, affecting the accurate analysis.

11 Use of Big Data Demand, Supply, Planning Risk management, Fraud, Trades Brand perception, Experience Genome mapping, Diagnostics

12 Architecture Big data repositories have existed in many forms, often built by corporations with a special need. Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system.

13 Architecture In 2000, Seisint Inc. (now LexisNexis Group) developed a C++ based distributed file-sharing framework for data storage and query. In 2004, Google published a paper on a process called MapReduce that uses a similar architecture. An implementation of the MapReduce framework was adopted by an Apache open-source project named Hadoop.

14 Some of the companies using Hadoop

15 Hadoop Hadoop is a distributed framework that can be used to process large data sets that reside in clusters of computers. Because it is a framework, Hadoop is not a single technology or product. Instead, Hadoop is made up of four core modules.

16 Hadoop Architecture MapReduce Framework Yarn Infrastructure HDFS Federation Hadoop Common Cluster

17 Hadoop Architecture 1. The Cluster is the set of host machines (nodes). This is the hardware part of infrastracture. 2. Hadoop YARN (Yet Another Resource Negotiator) Provides the framework to schedule jobs and manage resources across the cluster that holds the data.

18 Hadoop Architecture Two important Yarn elements are: A. Resource manager (one per cluster) is the master. It knows where slaves are located and how many resources they have. Resource Manager Resource Scheduler Application Master Liveness Monitor Node Manager Liveness Monitor Several Event Handlers

19 Hadoop Architecture B. Node manager (many per cluster) is the slave. Container is a fraction of the NM capacity and it is used by the client for running a program. Node Manager Container #1 Container #2 Container #3 Container #4

20 Hadoop Architecture 3. Hadoop Distributed File System (HDFS) Provides access to application data. 4. Hadoop MapReduce A YARN-based parallel processing system for large data sets. 5. Hadoop Common A set of utilities that supports the three other core modules.

21 How Hadoop works Big Data Computer Cluster Block 1 Block 1 Map Result Block 2 Block 2 Map Reduce Block 3 Block 3 Map

22 HDFS Hadoop, by its very nature, is designed to run on multiple nodes, although it can be run on a single node. It runs on multiple nodes by implementing the Hadoop Distributed File System, or HDFS, which is a file system distributed across multiple nodes.

23 HDFS The way the HDFS works is by splitting the nodes into a system. One node is designated as the name node, and the rest become data nodes. Files are not stored on any one node; they are distributed across multiple data nodes and split into chunks of 64MB.

24 MapReduce A core component of how Hadoop works is MapReduce. MapReduce is a programming framework that enables you to run computation across the large amounts of data on the nodes. It s very simple at its core: you need only write two methods: map and reduce, and after that you define a JobConfig.

25 MapReduce The JobConfig is what defines the essential communication between your functions and the HDFS. While there are other parts of MapReduce, an application can be run with just these three components.

26 MapReduce Typically, one JobTracker and one TaskTracker monitor every MapReduce job. The JobTracker s function is to send work to available TaskTrackers in the cluster, and the TaskTracker is what monitors the task being performed.

27 Single-node cluster Hadoop Cluster Hadoop Client Cluster Machine Map/Reduce Agent HDFS Node Name Node

28 Hadoop Installation Setup Java environment 1. Download jdk-8u151-linux-i586.tar.gz at downloads/jdk8-downloads html 2. Move *gz file from Downloads to /opt, unzip, untar and add /opt/jdk1.8.0_151/bin to PATH variable (System wide in /etc/ environment)

29 Hadoop Installation 3. Set JAVA_HOME environment variable in /etc/ environment: JAVA_HOME= /opt/jdk1.8.0_151 Run source /etc/environment to take effect without logging out.

30 Hadoop Installation Add dedicated group and user 1. Create hadoop group: sudo addgroup hadoop 2. Add hduser to hadoop group: sudo adduser --ingroup hadoop hduser (Set password, but leave name, office etc. empty)

31 Hadoop Installation Setup SSH (Hadoop interacts with its nodes via SSH) 1. Install Secure Shell: sudo apt-get install ssh which ssh should return /usr/bin/ssh which sshd should return /usr/sbin/sshd

32 Hadoop Installation 2. Create and Setup SSH Certificates: su hduser Password: ssh-keygen -t rsa -P "" When prompted for filename press Enter. Identification and public key will be saved in.ssh folder as id_rsa and id_rsa.pub

33 Hadoop Installation Install Hadoop 1. Go to and download binary. Become root, move hadoop tar.gz into /usr/local directory and unzip. From /usr/local change owner: sudo chown -R hduser:hadoop hadoop-2.6.5

34 Hadoop Installation Setup Configuration Files 1. As hduser, open ~/.bashrc and append definition of Hadoop variables to the end of that file: #HADOOP VARIABLES START export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin

35 Hadoop Installation export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/ lib/native export HADOOP_OPTS="- Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END Run source ~/.bashrc for change to take effect.

36 Hadoop Installation 2. Hadoop environment variables are defined in: /usr/local/hadoop-2.6.5/etc/hadoop/ hadoop-env.sh make sure that JAVA_HOME is set as in: export JAVA_HOME=${JAVA_HOME}

37 Hadoop Installation 3. Configuration properties are defined in: /usr/local/hadoop-2.6.5/etc/hadoop/coresite.xml Create hadoop temporary directory as root and change its ownership: sudo mkdir -p /app/hadoop/tmp sudo chown hduser:hadoop /app/hadoop/tmp

38 Hadoop Installation Open core-site.xml as hduser and add: <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>a base for other temporary directories.</description> </property>

39 Hadoop Installation <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. </description> </property> </configuration>

40 Hadoop Installation 4. Setup MapReduce configuration file. Folder /usr/local/hadoop/etc/hadoop/ contains mapred-site.xml.template file which has to be renamed to mapred-site.xml

41 Hadoop Installation Open mapred-site.xml as hduser and add: <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>

42 Hadoop Installation 5. Setup HDFS configuration file located in /usr/local/hadoop/etc/hadoop/hdfs-site.xml Create two directories as root and change ownership: sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode sudo chown -R hduser:hadoop /usr/local/hadoop_store

43 Hadoop Installation Open hdfs-site.xml as hduser and add: <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property>

44 Hadoop Installation <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/namenode </value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop_store/hdfs/datanode </value> </property> </configuration>

45 Hadoop Installation 6. Format the new Hadoop Filesystem: hadoop namenode -format Note that the format command should be executed once before we start using Hadoop. If this command is executed again after Hadoop has been used, it'll destroy all the data on the Hadoop file system.

46 Hadoop Installation 7. Start hadoop (as hduser): /usr/local/hadoop-2.6.5/sbin/start-dfs.sh Check if it s running: jps 4960 NameNode 5458 SecondaryNameNode 5599 Jps 5164 DataNode

47 Hadoop Installation 8. Stop hadoop (as hduser): /usr/local/hadoop-2.6.5/sbin/stop-dfs.sh 9. Hadoop s web ui interfaces are at

48 Hadoop Web Interfaces

49 Hadoop Web Interfaces

50 MapReduce preparation The underlying structure of the HDFS filesystem is very different from our normal file systems (64 MB blocks or larger). Additional considerations are: immutable outputs and actual coding.

51 MapReduce Job Loading Files File System Output You Define 64 MB Native File system Input, Map, Reduce, Output HDFS Immutable Java 128 MB Cloud key-value pairs

52 Running MapReduce job Hadoop comes with several examples. They are located in /usr/local/hadoop-2.6.5/share/ hadoop/mapreduce directory. The following example calculates the value of PI: hadoop jar /usr/local/hadoop-2.6.5/ share/hadoop/mapreduce/hadoopmapreduce-examples jar pi 2 5

53 Running MapReduce job Job finished in seconds Estimated value of Pi is We can vary the number of map tasks (instead of 2 try 16) and samples (instead of 5 try 1000): Job finished in seconds Estimated value of Pi is

54 HDFS Features It is suitable for the distributed storage and processing. Hadoop provides a command interface to interact with HDFS.

55 HDFS Features The built-in servers of namenode and datanode help users to easily check the status of cluster. Streaming access to file system data. HDFS provides file permissions and authentication.

56 HDFS Architecture Metadata ops Name Node Metadata: /home/hduser/data Client Block ops Read Write Replication Data Node Data Node Data Node Data Node Rack 1 Rack 2

57 Namenode It acts as the master server (commodity hardware, GNU/Linux OS, namenode software) and performs the following tasks: Manages the filesystem namespace Regulates client s access to files Executes file system operations, i.e. runs JobTracker which allocates job to TaskTrackers running on DataNodes

58 Datanode It acts as the slave server (commodity hardware, GNU/Linux OS, datanode software) and performs the following tasks: Read-write operations on file system as per client s request Block creation, deletion and replication as per namenode s request as well as TaskTracker

59 Blocks User data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and stored in individual data nodes. These file segments are called blocks (default size is 64MB)

60 HDFS shell commands Create directory hadoop fs -mkdir <paths> Example: hadoop fs -mkdir /user/hduser/bible-output List the contents of a directory hadoop fs -ls <args> Example: hadoop fs -ls /user/hduser/bible-output

61 HDFS shell commands Copy file from/to local FS to HDFS hadoop fs -copyfromlocal <local dir> <HDFS dir> Example: hadoop fs -copyfromlocal /home/branko/downloads/bible.txt /user/hduser hadoop fs -copytolocal /user/hduser/bible-output /home/hduser/bible-output

62 HDFS shell commands See contents of a file hadoop fs -cat <path[filename]> Example: hadoop fs -cat /user/hduser/bible-output/part-r Move file from source to destination hadoop fs -mv <src> <dest> Example: hadoop fs -mv /user/hduser/dir1/abc.txt /user/hduser/dir2

63 HDFS shell commands Remove a file or directory hadoop fs -rm <arg> Example: hadoop fs -rm /user/hduser/dir1/abc.txt Recursive version of remove hadoop fs -rmr <arg> Example: hadoop fs -rmr /user/hduser/dir1

64 wordcount MapReduce job WordCount is a simple application that counts the number of occurrences of each word in a given input set. First of all, copy bible.txt into /user/hduser: hadoop fs -copyfromlocal /home/branko/downloads/bible.txt /user/hduser

65 wordcount MapReduce job If you run: hadoop fs -ls /user/hduser/ You should see: Found 1 items -rw-r r-- 1 hduser supergroup :32 /user/hduser/bible.txt

66 wordcount MapReduce job Run mapreduce job as: hadoop jar /usr/local/hadoop-2.6.5/share/ hadoop/mapreduce/hadoop-mapreduceexamples jar wordcount /user/hduser/ bible.txt /user/hduser/bible-output Observe that directory is created /user/hduser/bible-output

67 wordcount MapReduce job Copy output to local fs: hadoop fs -copytolocal /user/hduser/bible-output /home/hduser/bible-output MapReduce job output is in the file part-r To find the 500 most frequent words, run: cat part-r sort -n -k2 -r head -n500 > Top-500.txt

68 Mapreduce MapReduce is a processing technique for Hadoop s distributed computing. The MapReduce algorithm contains two important tasks: Map and Reduce.

69 Mapreduce Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Reduce takes the output from a map as an input and combines those data tuples into a smaller set of tuples.

70 Inputs and Outputs (Java) The MapReduce framework operates exclusively on <key,value> pairs. Input to the job is a set of <key,value> pairs as well as the output, conceivably of different types.

71 Inputs and Outputs (Java) The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.

72 Inputs and Outputs (Java) Input Output Map <k1,v1> List(<k2,v2>) Reduce <k2, List(v2)> List(<k3,v3>)

73 Mapreduce Input Hello Hadoop Goodbye Hadoop Splitting k1,v1 Hello Hadoop Goodbye Hadoop Mapping List(k2,v2) Hello, 1 Hadoop, 1 Goodbye, 1 Hadoop, 1 Shuffling k2,list(v2) Goodbye, (1) Hadoop, (1,1) Hello, (1) Reducing Goodbye, 1 Hadoop, 2 Hello, 1 Output List(k3, v3) Goodbye, 1 Hadoop, 2 Hello, 1

74 WordCount.java In order to compile Java source, we need to set up environment variable HADOOP_CLASSPATH in hduser s ~/.bashrc: export HADOOP_CLASSPATH=${JAVA_HOME}/ lib/tools.jar

75 WordCount.java Compile with: hadoop com.sun.tools.javac.main WordCount.java Three classes are created: -rw-r--r-- 1 hduser hadoop -rw-r--r-- 1 hduser hadoop -rw-r--r-- 1 hduser hadoop -rw-r--r-- 1 hduser hadoop 1491 Jan 8 14:41 WordCount.class 1739 Jan 8 14:41 WordCount$IntSumReducer.class 2089 Jan 8 14:36 WordCount.java 1736 Jan 8 14:41 WordCount$TokenizerMapper.class

76 WordCount.java Create a jar: jar cf wc.jar WordCount*.class Create directory: hadoop fs -mkdir /user/hduser/input Create file test.txt (Hello Hadoop Goodbye Hadoop)

77 WordCount.java Copy file to hdfs: hadoop fs -copyfromlocal /home/hduser/test.txt /user/hduser/input Run: hadoop jar wc.jar WordCount /user/hduser/input /user/hduser/output

78 WordCount.java Copy file to local fs: hadoop fs -copytolocal /user/hduser/output /home/hduser/tmp/ output Output can be found in part-r file: Goodbye 1 Hadoop 2 Hello 1

79 WordCount.java import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat;

80 WordCount.java public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } }

81 WordCount.java public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { } } int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result);

82 WordCount.java } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); }

83 class Mapper Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> Maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks which transform input records into a intermediate records.

84 class Mapper The framework then calls the map method for each line in the input. Individual words are extracted through StringTokenizer. In our example, map method implements the algorithm (in pseudo-code): foreach word w in line emit (word, 1)

85 class Mapper That is, method map takes input text and tokenizes it into tuples (key,value), where the key is each individual word and value is constant (1). private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } }

86 class Reducer Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> Reduces a set of intermediate values which share a key to a smaller set of values. Reducer has 3 primary phases:

87 class Reducer 1. Shuffle The Reducer copies the sorted output from each Mapper using HTTP across the network. 2. Sort The framework merge sorts Reducer inputs by keys. The shuffle and sort phases occur simultaneously.

88 class Reducer 3. Reduce In this phase the reduce(object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs. The output of the reduce task is typically written to a RecordWriter via write(object, Object).

89 class Reducer In our example, reduce method implements the algorithm (in pseudo-code): sum = 0 foreach v in values: sum = sum + v emit (word, sum)

90 class Reducer private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { } int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result);

91 main() method The following properties need to be setup in the main() method: 1. Output key class 2. Output value class 3. Mapper class 4. Reducer class 5. Input format 6. Output format 7. Input file path 8. Output file path

92 main() method } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); }

93 WordCount - multiple files Input to MapReduce job is not limited to a single file. Create another file (test1.txt Hello World Godbye World) and move it to /user/hduser: fs -copyfromlocal /home/hduser/test1.txt /user/hduser/input Remove output directory and run again.

94 Tabular data Our next example is going to use data set (national pollutant release inventory, csv format at 22abff18-6f9d-4926-b7de-3a80c178bf95) We want to run mapreduce job to find out which province released the largest amount of pollutants.

95 Tabular data Since Hadoop reads input one line at a time, we don t need StringTokenizer. As soon as we get new token (line), we need to split it into array of Strings: String temp = value.tostring(); String[] air = temp.split(",");

96 Tabular data String at index 6 is province (key) and string at index 19 is total release of pollutants in tonnes (value). String a = air[6].replaceall("\"", ""); String b = air[19].replaceall("\"", "");

97 Regular Expressions public String[] split(string regex) This method splits string into array of strings around matches of the given regular expression. Regex or Regular Expression is a way to describe a set of strings based on common characteristics shared by each string in the set.

98 Regular Expressions For our data set, simple regular expression (, ) is not going to work since comma is not only used to separate fields, but also inside fields. For example, look at the address in second line: Site 4, Box 1, RR1 We ll need more complex regex to accept commas between fields, but ignore them within fields.

99 Regular Expressions Regular expressions define sets of strings that share the common pattern. They can be used for searching, extracting and modifying text. Basic regex constructs include character classes, quantifiers, boundaries and groupings.

100 Character Classes Character classes are used to define the content of the pattern. E.g. what should the pattern look for? Symbol Description. Any character \d A digit [0-9] \D A non-digit [^0-9] \s A whitespace character [ \t\n\x0b\r\f] \S A non-whitespace character [^\s] \w A word character [a-za-z_0-9] \W A word character [^\w]

101 Character Classes Consider the following regular expression: String regex = "H[ea]llo"; The set of strings defined bey regex is {Hello, Hallo}

102 Quantifiers Quantifiers can be used to specify the length that part of a pattern should match. A quantifier will bind to the expression group to its immediate left. Symbol Description * Match 0 or more times + Match 1 or more times? Match 1 or 0 times {n} Match exactly n times {n,} Match at least n times {n,m} Match at least n times but not more than m times

103 Quantifiers Consider the following regular expression: String regex = "Hallo{2,4}"; The set of strings defined by regex is {Helloo,Hellooo,Helloooo}

104 Boundaries A boundary could be the beginning of a string, the end of a string, the beginning of a word etc. Symbol Description ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word boundary \A The beginning of input \G The end of the previous match \z The end of input

105 Boundaries The following code excerpt extracts words beginning with letter l : String text = "Mary had a little lamb"; Pattern pattern = Pattern.compile("\\bl\\w+\\b"); Matcher matcher = pattern.matcher(text); while(matcher.find()){ System.out.println(matcher.group()); }

106 Groups A group is a captured subsequence of characters which may be used later in the expression with a backreference. Symbol Description () Defines a group \N Refers to a matched groupd

107 Groups The following code excerpt extracts matched words from the specified group of words: String input = "I have a cat, but I like my dog better."; Pattern p = Pattern.compile( "(mouse cat dog wolf bear human)"); Matcher m = p.matcher(input); while (m.find()) { System.out.println(m.group()); }

108 Pattern and Matcher classes Class java.util.regex.pattern is a compiled representation of a regular expression. A regular expression, specified as a string, must first be compiled into an instance of this class.

109 Pattern and Matcher classes The resulting pattern can then be used to create a java.util.regex.matcher object that can match arbitrary character sequences against the regular expression. A typical invocation sequence is thus: Pattern p = Pattern.compile("a*b"); Matcher m = p.matcher("aaaaab"); boolean b = m.matches();

110 Pattern and Matcher classes Most commonly used Pattern methods: Pattern compile(string regex) Compiles the given regular expression into a pattern. String pattern() Returns the regular expression from which this pattern was compiled.

111 Pattern and Matcher classes String[] split(charsequence input) Splits the given input sequence around matches of this pattern. Matcher matcher(charsequence input) Creates a matcher that will match the given input against this pattern.

112 Pattern and Matcher classes Most commonly used Matcher methods: boolean matches() Attempts to match the entire region against the pattern. boolean find() Attempts to find the next subsequence of the input sequence that matches the pattern.

113 Pattern and Matcher classes int start() Returns the start index of the previous match. int end() Returns the o set after the last character matched. String group() Returns the input subsequence matched by the previous match.

114 Lookaround regex The sole purpose of regular expressions is to decide whether a string matches or contains a certain pattern. But sometimes we have the condition that this pattern is preceded or followed by another pattern. Lookaround regex comes handy when we don t want these conditions to be part of the match.

115 Positive lookahead (?=) A(?=B) Find expression A where expression B follows. Example: find first occurrence of substring xyz that must be immediately followed by the same substring. String regex = "xyz(?=xyz)"; Pattern pattern = Pattern.compile(regex); String s = "abcxyzxyzabc"; Matcher matcher = pattern.matcher(s);

116 Negative lookahead (?!) A(?!B) Find expression A where expression B does not follow. Example: find first occurrence of substring xyz that must NOT be immediately followed by the same substring. String regex = "xyz(?!xyz)"; Pattern pattern = Pattern.compile(regex); String s = "abcxyzxyzabc"; Matcher matcher = pattern.matcher(s);

117 Positive lookbehind (?<=) (?<=B)A Find expression A where expression B precedes. Example: find first occurrence of substring xyz that must be immediately preceded by substring abc. String regex = "(?<=abc)xyz"; Pattern pattern = Pattern.compile(regex); String s = "abcxyzxyzabc"; Matcher matcher = pattern.matcher(s);

118 Negative lookbehind (?<!) (?<!B)A Find expression A where expression B does NOT precede. Example: find first occurrence of substring xyz that must not be immediately preceded by substring abc. String regex = "(?<!abc)xyz ; Pattern pattern = Pattern.compile(regex); String s = "abcxyzxyzabc"; Matcher matcher = pattern.matcher(s);

119 Lookarounds combined Lookaround expressions can be combined. For example, regex: String regex = "(?<=abc)xyz(?=xyz)"; Lookaround will match the first occurence of xyz in abcxyzxyzabc, since it must be preceded by abc and followed by xyz.

120 Java String regex methods Java String class has several regular expression methods too. boolean matches(string regex) Tells whether or not this string matches the given regular expression. String s = "one two three two one"; boolean m = s.matches(".*two.*"); // true

121 Java String regex methods String[] split(string regex) Splits this string around matches of the given regular expression. String s = "one two three two one"; String[] t = s.split("two"); // t[0] = "one",t[1] = " three",t[2] = " one "

122 Java String regex methods String replacefirst(string regex, String replacement) Replaces the first substring of this string that matches the given regular expression with the given replacement. String s = "one two three two one"; String t = s.replacefirst("two", "2"); // t = one 2 three two one"

123 Java String regex methods String replaceall(string regex, String replacement) Replaces each substring of this string that matches the given regular expression with the given replacement. String s = "one two three two one"; String t = s.replaceall("two", "2"); // t = one 2 three 2 one"

124 Comma separated sequence Back to our example, where we need to split string "one","two","1,2,3" into strings: "one", "two" and "1,2,3". In the first attempt, one might be tempted to write: String reg = ","; String str = "\"one\",\"two\",\"1,2,3\""; String[] rp = str.split(reg);

125 Comma separated sequence But this will result in 5 strings: "one", "two", "1, 2 and 3". If we change regex to "\",\"", i.e. split string if and only if comma is preceded and succeded by double quotes, we ll get: "one, two and 1,2,3"

126 Comma separated sequence We could replace all occurences of double quotes by empty string (rp[i].replaceall("\"","")) and we re done. But, unfortunately, all csv formats are not the same. Add spaces around commas and run again. Input string won t be split at all since there is no pattern match.

127 Comma separated sequence We could change regex to "\" *, *\"", to include any number of spaces preceding and following comma between double quotes. This would (sort of) do the job, but we would still not get strings: "one", "two" and "1,2,3" Instead, we would get: "one, two and 1,2,3"

128 Comma separated sequence Our original criterion (split input on comma, but ignore commas within double quotes) can be rephrased: Split on the comma only if that comma has zero, or an even number of quotes ahead of it.

129 Comma separated sequence "one", "two", "1,2,3" For this we need lookahead construct: " *, *(?=([^\"]*\"[^\"]*\")*[^\"]*$)";

130 PolluterCount.java import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.doublewritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.commons.lang.math.numberutils;

131 PolluterCount.java public class PolluterCount { public static class TokenizerMapper extends Mapper <Object, Text, Text, DoubleWritable> { private DoubleWritable val = new DoubleWritable(); private Text word = new Text();

132 PolluterCount.java } public void map(object key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); String[] air = line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"); if(air.length > 1) { String a = air[6].replaceall("\"", ""); String b = air[19].replaceall("\"", ""); if(numberutils.isnumber(b)) { double d = Double.parseDouble(b); val.set(d); word.set(a); context.write(word, val); } } }

133 PolluterCount.java public static class DoubleSumReducer extends Reducer<Text,DoubleWritable,Text,DoubleWritable> { } private DoubleWritable result = new DoubleWritable(); public void reduce(text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException { } double sum = 0; for (DoubleWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result);

134 PolluterCount.java } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "polluter count"); job.setjarbyclass(pollutercount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(doublesumreducer.class); job.setreducerclass(doublesumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(doublewritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); }

135 PolluterCount.java Compile: hadoop com.sun.tools.javac.main PolluterCount.java Create jar: jar cf pc.jar PolluterCount*.class Create directory: hadoop fs -mkdir /user/hduser/ pcinput Copy file to hdfs: hadoop fs -copyfromlocal /home/hduser/air_data.csv /user/hduser/pcinput

136 PolluterCount.java Run: hadoop jar pc.jar PolluterCount /user/hduser/pcinput/air_data.csv /user/hduser/pcoutput Copy file to fs: hadoop fs -copytolocal /user/hduser/pcoutput /home/hduser Sort: cat part-r sort -n -k2 -r > sorted.txt

137 PolluterCount.java part-r sorted Province Tonnes emitted Province Tonnes emitted AB BC MB NB NL NS NT NU ON PE QC SK AB ON QC BC MB SK NB NS NL PE NT NU

138 Apache log file format (common access) Analysis of large Apache web server log files is one of many hadoop applications. Apache log entry might look like: [01/Jul/1995:00:00: ] "GET /history/apollo/ HTTP/1.0"

139 Apache log file format (common access) The format is: "%h %l %u %t \"%r\" %>s %b" where: 1. %h is the remote host (client IP) 2. %l is identity of the user determined by identd 3. %u is identity of the user determined by HTTP authentication

140 Apache log file format (common access) 4. %t is the time the request was received 5. \"%r\" is the request line from the client ("GET /history/apollo/ HTTP/1.0") 6. %>s is the status code sent from the server to the client (200, 404 etc.) 7. %b is the size of the response to the client (in bytes)

141 Apache log file format (common access) In order to parse these logs we need to define pattern: String apachepattern = "^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(\\S+) (\\S+) (\\S+)\" (\\d{3}) (\\d+)"; See LogParser.java for details.

142 Hadoop Ecosystem Apache Oozie (Workflow scheduler) Hive (SQL) Pig Latin (Scripting) Mahout (Machine Learning) HBase MapReduce HDFS Import/export non-structured/structured data Flume (logs) Sqoop (RDBMS)

143 Hadoop ecosystem MapReduce and HDFS is all we need to process large amount of data. However, this processing is batch processing - prepare input, process it and analyze output. For real (or near real) time processing one needs HBase.

144 HBase HBase is a non-relational distributed database modelled after Google s Bigtable. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop. Unlike relational databases, HBase does not support SQL scripting - oops!

SQL-like queries (HiveQL) into the underlying Java without the

145 Hive That s where Hive comes in. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Example: SELECT MAX(col_name) AS label FROM table;

146 Pig Latin Don t like writing MapReduce jobs in Java? Try Pig Latin. It is a scripting language that abstracts the programming from the Java MapReduce idiom. Example: words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

147 Mahout How to provide recommendations (e.g. movies you might be interested in on Netflix)? Detect mail spam? Auto-organize new content? Apache Mahout uses Hadoop to produce free implementation of distributed machine learning algorithms.

148 Apache Oozie is a workflow scheduler system to manage Apache Hadoop jobs. It is integrated with the rest of Hadoop stack supporting Java MapReduce, Hive, Pig, Sqoop etc.

149 Flume Apache Flume is a distributed service for efficiently collecting, aggregating, and moving large amounts of log data.

150 Sqoop Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. Imports can also be used to populate tables in HBase.

151 HBase Major deficiency of a file system (HDFS) is that it does not support random read/write operations in real time. HBase is non-relational (relationships between tables are not supported) database system modelled after Google s Bigtable.

152 HBase It is a distributed storage system on top of HDFS designed to scale to petabytes of data and thousands of machines. Physically, HBase is composed of three types of servers in a master slave type of architecture: HMaster, RegionServers and Zookeper.

153 HBase RegionServers serve data for reads and writes. When accessing data, clients communicate with HBase RegionServers directly. Region assignment, DDL (create, delete tables) operations are handled by the HBase Master process. Zookeeper, which is part of HDFS, maintains a live cluster state.

154 HBase Zookeeper Register HMaster HMaster Lookup Master Read Client Write Assign Regions to RegionServers Region Region Region Region RegionServer 1 RegionServer 2

155 HBase Data Model Applications store data in HBase tables. Tables are made of rows and columns. All columns in HBase belong to a particular column family. Table cells are versioned (time-stamped). Cell s content is uninterpreted array of bytes.

156 HBase Data Model Table row keys are also byte arrays. Rows in HBase tables are sorted by row key. The sort is byte-ordered. All table accesses are via the table row key - its primary key.

157 Conceptual View Table below contains two column families named contents and anchor. In this example, anchor contains two columns (anchor:cssnsi.com, anchor:my.look.ca) and contents contains one column (contents:html). RowKey Time Stamp CF1:contents CF2:anchors com.cnn.www t9 anchor:cnnsi.com="cnn" com.cnn.www t8 anchor:my.look.ca= CNN.com com.cnn.www t6 contents:html= <html> com.cnn.www t5 contents:html= <html> com.cnn.www t3 contents:html= <html>

158 Physical View Although at a conceptual level tables may be viewed as a sparse set of rows. Physically they are stored on a per-column family basis. RowKey Time Stamp CF2:anchors com.cnn.www t9 anchor:cnnsi.com="cnn" com.cnn.www t8 anchor:my.look.ca= CNN.com RowKey Time Stamp CF1:contents com.cnn.www t6 contents:html= <html> com.cnn.www t5 contents:html= <html> com.cnn.www t3 contents:html= <html>

159 HBase Installation Download hbase bin.tar.gz and unzip tar -xvzf hbase bin.tar.gz Move directory: sudo mv hbase /usr/local/ hbase From /user/local, change ownership as root: chown -R hduser:hadoop hbase-1.3.1

160 HBase Installation As hduser open ~/.bashrc and add: #HBASE VARIABLES START export HBASE_HOME=/usr/local/hbase export PATH=$PATH:$HBASE_HOME/bin #HBASE VARIABLES END Run: source ~/.bashrc

161 HBase Installation Setup hbase-env.sh in /usr/local/ hbase-1.3.1/conf Uncomment export JAVA_HOME and set it to: export JAVA_HOME=/opt/jdk1.8.0_151 Add: export HBASE_REGIONSERVERS=${HBASE_HOME}/ conf/regionservers export HBASE_MANAGES_ZK=true

162 HBase Installation Setup hbase-site.xml in /usr/local/ hbase-1.3.1/conf <configuration> <property> <name>hbase.rootdir</name> <value>hdfs://localhost:54310/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property>

163 HBase Installation <property> <name> hbase.zookeeper.property.clientport </name> <value>2222</value> </property> <property> <name> hbase.zookeeper.property.datadir </name> <value>/home/hduser/zookeeper</value> </property> </configuration>

164 HBase Installation Start HBase: start-hbase.sh See if started: jps 6098 HMaster 6498 Jps 6215 HRegionServer 4951 SecondaryNameNode 4456 NameNode 6040 HQuorumPeer 4667 DataNode

165 HBase Shell Start HBase shell: hbase shell hbase(main):001:0> Check status: hbase(main):001:0> status 1 active master, 0 backup masters, 1 servers, 0 dead, average load

166 HBase Shell Create table: hbase(main):003:0> create 'test', 'cf' 0 row(s) in seconds => Hbase::Table - test

167 HBase Shell List table info: hbase(main):004:0> list 'test' TABLE test 1 row(s) in seconds

168 HBase Shell Populate table: hbase(main):005:0> put 'test', 'row1', 'cf:a', 'value1' 0 row(s) in seconds hbase(main):006:0> put 'test', 'row2', 'cf:b', 'value2' 0 row(s) in seconds hbase(main):007:0> put 'test', 'row3', 'cf:c', 'value3' 0 row(s) in seconds

169 HBase Shell Scanning table for all data at once: hbase(main):008:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp= , value=value1 row2 column=cf:b, timestamp= , value=value2 row3 column=cf:c, timestamp= , value=value3 3 row(s) in seconds

170 HBase Shell Get a single row of data: hbase(main):009:0> get 'test', 'row1' COLUMN CELL cf:a timestamp= , value=value1 1 row(s) in seconds Disable HBase table: hbase(main):010:0> disable 'test' Enable HBase table: hbase(main):010:0> enable 'test'

171 HBase Shell Drop HBase table: hbase(main):009:0> drop test' Table must first be disabled. Exit Hbase shell: hbase(main):010:0> quit Stop HBase: stop-hbase.sh

172 HBase as source/sink HBase tables can be used as source/sink to MapReduce jobs. Source is defined as: TableMapReduceUtil.initTableMapperJob( "sourcetable", //input table scan, //scan to control cf MyMapper.class, //mapper class null, //mapper output key null, //mapper output value job);

173 HBase as source/sink Sink is defined as: TableMapReduceUtil.initTableReducerJob( "targettable", //output table MyReducer.class, //reducer class job);

174 TableMapper Class TableMapper<KEYOUT,VALUEOUT> Where KEYOUT and VALUEOUT are types of the output key and the value respectively. Method map() is inherited from the superclass Mapper.

175 TableReducer Class TableReducer<KEYIN,VALUEIN,KEYOUT> Where KEYIN, VALUEIN and KEYOUT are types of the input/output keys and the input value respectively.

176 TableReducer While the input key and value as well as the output key can be anything handed in from the previous map phase, the output value must be either a Put or a Delete. Method reduce() is inherited from the superclass Reducer.

177 HBase as source/sink example Our next example is going to use HBase table test1 that stores dates and sales and populate table test2 with total sales for each day. Date format is YYYYMMDD#n, where n is the sale counter for particular date.

178 HBase as source/sink example test1 RowKey Time Stamp CF1:sales #1 t #2 t #1 t #2 t4 210 test2 RowKey Time Stamp CF1:sum t t6 410

179 HBaseCount.java Create tables: hbase(main):003:0> create 'test1', 'cf1' 0 row(s) in seconds => Hbase::Table - test1 hbase(main):004:0> create 'test2', 'cf1' 0 row(s) in seconds => Hbase::Table - test2

180 HBase Shell Populate table test1: hbase(main):005:0> put 'test1',' #1','cf1:sales','100' 0 row(s) in seconds hbase(main):006:0> put 'test1',' #2','cf1:sales','110' 0 row(s) in seconds hbase(main):007:0> put 'test1',' #1','cf1:sales','200' 0 row(s) in seconds hbase(main):008:0> put 'test1',' #2','cf1:sales','210' 0 row(s) in seconds

181 HBaseCount.java import java.io.ioexception; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; //HDBASE import org.apache.hadoop.hbase.mapreduce.tablemapper; import org.apache.hadoop.hbase.mapreduce.tablereducer; import org.apache.hadoop.hbase.mapreduce.tablemapreduceutil; import org.apache.hadoop.hbase.client.scan; import org.apache.hadoop.hbase.client.put; import org.apache.hadoop.hbase.client.result; import org.apache.hadoop.hbase.util.bytes; import org.apache.hadoop.hbase.io.immutablebyteswritable;

182 HBaseCount.java public class HBaseCount { public static class HBaseMapper extends TableMapper <Text, IntWritable> { private Text text = new Text(); public void map(immutablebyteswritable rowkey, Result columns, Context context) throws IOException, InterruptedException { String inkey = new String(rowKey.get()); String okey = inkey.split("#")[0]; text.set(okey); byte[] bsales = columns.getvalue(bytes.tobytes("cf1"), Bytes.toBytes("sales")); String ssales = new String(bSales); Integer sales = new Integer(sSales); context.write(text, new IntWritable(sales)); } }

183 HBaseCount.java public static class HBaseReducer extends TableReducer <Text, IntWritable, ImmutableBytesWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable sales : values) { Integer intsales = new Integer(sales.toString()); sum += intsales; } Put inshbase = new Put(key.getBytes()); inshbase.addcolumn(bytes.tobytes( cf1"), Bytes.toBytes( sum"), Bytes.toBytes(sum)); context.write(null, inshbase); } }

184 HBaseCount.java } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Scan scan = new Scan(); scan.addfamily(bytes.tobytes("cf1")); Job job = Job.getInstance(); job.setjarbyclass(hbasecount.class); } TableMapReduceUtil.initTableMapperJob( "test1", scan, HBaseMapper.class, Text.class, IntWritable.class, job); TableMapReduceUtil.initTableReducerJob( "test2", HBaseReducer.class, job); job.waitforcompletion(true);

185 HBaseCount.java Modify: /home/hduser/.bashrc HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar #HBASE VARIABLES START export HBASE_HOME=/usr/local/hbase export PATH=$PATH:$HBASE_HOME/bin HADOOP_CLASSPATH="$HADOOP_CLASSPATH: $HBASE_HOME/lib/*" #HBASE VARIABLES END export HADOOP_CLASSPATH

186 HBaseCount.java Refresh: source.bashrc Compile: javac -cp $(hbase classpath):$(hadoop classpath) HBaseCount.java Create jar: jar cf hb.jar HBaseCount*.class Run: hadoop jar hb.jar HBaseCount

187 HBaseCount.java Results in hbase shell: hbase(main):003:0> scan 'test2' ROW COLUMN+CELL column=cf1:sum, timestamp= , value=\x00\x00\x00\xd column=cf1:sum, timestamp= , value=\x00\x00\x01\x9a 2 row(s) in seconds hbase(main):001:0> org.apache.hadoop.hbase.util.bytes.toint("\x00\x00\x00\xd2".to_java_bytes) => 210 hbase(main):004:0> org.apache.hadoop.hbase.util.bytes.toint("\x00\x00\x01\x9a".to_java_bytes) => 410

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example