Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi

Size: px

Start display at page:

Download "Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi"

Ashley Clarke
6 years ago
Views:

1 Session 1 Big Data and Hadoop - Overview - Dr. M. R. Sanghavi

2 Acknowledgement Prof. Kainjan M. Sanghavi For preparing this prsentation This presentation is available on my blog

3 Big Data & Hadoop

4 Topics

5 What is Data? Distinct pieces of information, usually formatted in a special way. Data can exist in a variety of forms -- as numbers or text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored in a person's mind.

6 Data Management Data storage Local Place Ex. Company/ Colleges/ Hospitals Security / Size Central Storage Place - Datacenters Older Practices Pay for space Pay for Disk In case of Disaster Local Disk Tape Drives SAN / NAS

7 Data Management 1 Kilobyte 1,000 bits/byte 1 Megabyte 1,000,000 1 Gigabyte 1,000,000,000 1 Terabyte 1,000,000,000,000 1 Petabyte 1,000,000,000,000,000 1 Exabyte 1,000,000,000,000,000,000 1 Zettabyte 1,000,000,000,000,000,000,000

8 What is Bigdata? No single standard definition Big Data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it

9 Big Data : unstructured, structured, streaming Website Social Media Billing ERP CRM RFID Network Switches

10 200+ Customer Stories Finance & Insurance Academic & Gov t Healthcare & Life Sciences Digital Media & Retail Manufacturing & High Tech 10

11 Number of Attendees

12 V3 Architecture

13 Existing challenges

14 Requirements Scalability Flexibility Fault Tolerance Resource Management Security Single System Easy to use

15 HADOOP IS THE SOLUTION

Hadoop? Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers.

17 Hadoop? Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Hadoop enables a computing solution Scalable Cost Effective Flexible Fault Tolerance

18 Products developed by vendors Apache Origin Cloudera Hortonworks Intel

19 Hadoop supported platforms Platforms: Unix and on Windows Linux: the only supported production platform. Other variants of Unix, like Mac OS X: run Hadoop for development. Windows + Cygwin: development platform (openssh)

20 Hadoop Components HDFS Mapreduce

21 Hadoop installation modes Standalone (or local) mode : Hadoop Is not installed, but libraries of hadoop are used. Similar to emulator Pseudo-distributed mode : Single node cluster i.e installing hadoop on one machine Fully distributed mode : Here namenode, datanode & secondary namenode appear on one machine and the jobtracker and tasketracker are on other machines

22 Standalone / LocalJobRunner Mode In LocalJobRunner mode, no daemons run Everything runs in a single Java Virtual Machine (JVM) Hadoop uses the machine s standard filesystem for data storage Not HDFS Suitable for testing MapReduce programs during development

Pseudo-Distributed Mode In pseudo-distributed mode, all daemons run on the local machine Each runs in its own JVM (Java Virtual Machine) Hadoop uses

23 Pseudo-Distributed Mode In pseudo-distributed mode, all daemons run on the local machine Each runs in its own JVM (Java Virtual Machine) Hadoop uses HDFS to store data (by default) Useful to simulate a cluster on a single machine Convenient for debugging programs before launching them on the real cluster

Fully-Distributed Mode In fully-distributed mode, Hadoop daemons run on a cluster of machines HDFS used to distribute data amongst the nodes Unless you are running a small

24 Fully-Distributed Mode In fully-distributed mode, Hadoop daemons run on a cluster of machines HDFS used to distribute data amongst the nodes Unless you are running a small cluster (less than 10 or 20 nodes), the NameNode and JobTracker should each be running on dedicated nodes For small clusters, it s acceptable for both to run on the same physical node

25 Hadoop Core omponents HDFS [Hadoop Distributed File System] Mapreduce [Parallel Distributed Platform]

26 HDFS Daemons Namenode Secondary Namenode Datanodes

27 HDFS Architecture Overview Host 1 Namenode Master Host 3 DataNode/Slaves Host 5 DataNode/Slaves Host 2 Secondary NN/Master Host 4 DataNode/Slaves Host n DataNode/Slaves 27

28 HDFS Block Diagram Datanode 1 Datanode 2 Namenode Datanode 3 Datanode 4 Datanode N Secondary Namenode

29 HDFS Features

30 HDFS Block Replication Block Size = 64MB/128 Replication Factor = 3 Blocks HDFS Datanode Datanode Datanode Datanode 2 Datanode 5

31 Namenode (Master node) The NameNode stores all metadata Information about file locations in HDFS Information about file ownership and permissions Names of the individual blocks Locations of the blocks Metadata is stored on disk and read when the NameNode daemon starts up Filename is fsimage When changes to the metadata are required, these are made in RAM Changes are also written to a log file on disk called edits

Secondary Namenode / checkpoint node fsimage Latest snapshot of filesystem to which

Namenode fsimage 7 3 Secondary Namenode edit logs are applied to fsimage to get the

Edits will grow larger The following issues we will encounter Editlog become very

lot of changes has to be merged In the case of crash, we will lost huge amount of

32 Secondary Namenode / checkpoint node fsimage Latest snapshot of filesystem to which namenode refers Edit logs - changes made to the filesystem after namenode started Namenode fsimage 7 3 Secondary Namenode edit logs are applied to fsimage to get the latest snapshot of the file system on NN restart 1 7 Rare restart of NN in production. Edits will grow larger The following issues we will encounter Editlog become very large, which will be challenging to manage it Namenode restart takes long time because lot of changes has to be merged In the case of crash, we will lost huge amount of metadata since fsimage is very old fsimage edits.log 2 Client 4 4 edits.log fsimage 5 fsimage 6 fsimage

33 Datanodes / Slave nodes Actual contents of the files are stored as blocks on the slave nodes Blocks are simply files on the slave nodes underlying filesystem Named blk_xxxxxxx Nothing on the slave node provides information about what underlying file the block is a part of - (That information is only stored in the NameNode s metadata) Each block is stored on multiple different nodes for redundancy Default is three replicas Each slave node runs a DataNode daemon Controls access to the blocks Communicates with the NameNode

34 DataNodes send hearbeat to the NameNode Once every 3 seconds NameNode uses heartbeats to detect DataNode failure

35 Mapreduce

36 What is Mapreduce MapReduce is a method for distributing a task across multiple nodes Consists of two developer-created phases Map Reduce In between Map and Reduce is the shuffle and sort Sends data from the Mappers to the Reducers

37 Mapreduce the big picture Client JOB

38 How Map and Reduce Work Together Map returns information Reducer accepts information Reducer applies a user defined function to reduce the amount of data

39 Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results Outline stays the same, Map and Reduce change to fit the problem

40 Data Flow 1. Mappers read from HDFS 2. Map output is partitioned by key and sent to Reducers 3. Reducers sort input by key 4. Reduce output is written to HDFS

41 MapReduce Job Flow

42 Mapreduce Process can be considered as being similar to Unix pipeline

43 Mapreduce Architecture Overview Host 3 TaskTracker/slaves Host 5 TaskTracker/slaves Host 1 Jobtracker/Master Host 4 TaskTracker/slaves Host n TaskTracker/slaves 43

44 Mapreduce Simple Example Sample input to Mapper: Intermediate data produced

45 Mapreduce Simple Example Input to reducer: Output from reducer:

47 Key and Value

48 Hadoop Ecosystem

49 Recent version of Hadoop

50 Questions?

51 Thank You!!! Reference : Hadoop Training shimpisagar@yahoo.com

52 Session 2 Installation Hadoop Single Mode - Ms B. A. Khivasara and Ms K. R. Nirmal

53 >> cd hadoop / >>ls // this gives list of folders in hadoop directory bin ivy => consists of deploy and installed deployment tools) c++ => consists of all header files of c++ lib => libraried needed by hadoop to submit job libexcc => third party libraries conf => configuration files logs => log files docs => help manual webapps

54 >> cd conf >>ls // this gives list of files in /hadoop/conf directory core-site.xml => used to store namenode information hdfs.site.xml => used for Distributed file system replication mapred-site.xml => used to specify the location where the jobtracker must be installed capacity-scheduler.xml=> indicates the job to be executed first

55 In core-site.xml after configuration tag include : hadoop.tmp.dir indicates the location to keep property data : as here it is /tmp : We can specify other directory name here fs.default.name indicates the location to store the namenode: as here it is hdfs://localhost:54310 : We can specify other ipaddress incase of multicluster

56 hdfs.site.xml dfs.replication indicates property of setting the number of replications or clusters : as here it is 1 : We can specify our own also.

57 mapred-site.xml mapred.job.tracker indicates the location where the jobtracker must be installed : as here it is localhost:54311 : We can specify our own also.

58 >> cd.. >>cd bin This folder has following files start-all.sh => used to start all the nodes of hadoop stop-all.sh => used to stop all nodes hadoop => used to i) execute map/reduce program ii) to perform file system operations >>./hadoop namenode format This formats the directory where hadoop is installed

59 This message indicates that namenode has been formated properly

60 >>start-all.sh

61 >> jps // To verify if hadoop is installed

62 Jdk must be installed Machine should be password less ssh (When two m/c s communicate with each other in linux they do it through ssh: i.e secured shell)

63 To make a m/c passworless ssh : >>sshd Not installed then type, >> sudo apt-get install openssh-server >>ssh-keygen t rsa p >>cd.ssh(in home folder) >> ls id_rsa.pub and id_rsa >>sshd if it shows absolute path ssh is installed

64 Create a directory in Hadoop File system using >>./hadoop fs mkdir foldername Browse the namenode in browser to check if folder is created with Namenode : JobTacker: TaskTracker : i) Browse the filesystem ii)click user iii)click gurukul

65 Session 3 Hadoop as Pseudo Distributed Mode (WordCount Program in Hadoop) - Ms K. M. Sanghavi

66 Typical problem solved by MapReduce () Map Process a key/value pair to generate intermediate key/value pairs () Reduce Merge all intermediate values associated with the same key Users implement interface of two primary methods: 1. Map: (key1, val1) (key2, val2) 2. Reduce: (key2, [val2]) [val3] Map - clause group-by (for Key) of an aggregate function of SQL Reduce - aggregate function (e.g., average) that is computed over all the rows with the

71 Program to run on Hadoop Download Eclipse IDE, latest version of Eclipse is Kepler 1. Create New Java Project 2. Add dependencies jar Right click on Project properties and select Build Path Add all jars from $HADOOP_HOME/lib and $HADOOP_HOME (where hadoop core and tools jar lives)

72 Program to run on Hadoop 3. Create Mapper 4. Create Reducer 5. Create Driver for MapReduce Job Map Reduce job is executed by useful hadoop utility class ToolRunner 6. Supply Input and Output We need to supply input file that will be used during Map phase and the final output will be generated in output directory by Reduct task.

73 Program to run on Hadoop 7. Map Reduce Job Execution Right click Kdriver and Select Run As Java Application 8. Final Output

74 Program to run on Hadoop

75 Program to run on Hadoop

76 Program to run on Hadoop

Mapper import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.

77 Mapper import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; Mapper< K1,V1, K2,V2 > has the map method <K1,V1,K2,V2>first pair is the input key/value pair, second is the output key/value pair public class KMapper extends protected void map(longwritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {

78 Mapper //Map method header protected void map(longwritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException { LongWritable key, Text value : Data type of the input Key and Value to the mapper. Mapper<LongWritable,Text,Text,LongWritable><K,V>.Context context : collect data output by the Mapper i.e. intermediate outputs or the output of the job Key and Value from the mapper. EX: < the,1>

79 Mapper //Convert the input line in Text type to a String and split it into words String words[] = value.tostring().split(" "); //Iterate through each word and a form key value pairs for(string w:words) { //Form key value pairs for each word as <word,one> and push it to the output collector } } } context.write(new Text(w), new LongWritable(1));

80 Reducer import java.io.ioexception; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; public class KReducer extends Reducer<Text,LongWritable,Text,LongWritable> { Reducer< K2,V2, K2,V3 > has the reduce method <K2,V2,K2,V3>first pair is the map key/value pair, second is the output key/value pair

81 Reducer //Reduce method header protected void reduce(text key, Iterable<LongWritable> value, Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException { Text key, Iterable<LongWritable> value : Data type of the input Key and Value to the Reducer. Iterator so we can go through the sets of values Reducer<LongWritable,Text,Text,LongWritable><K,V>.Context context : collect data output by the Reducer

82 Reducer //Initialize a variable sum as 0 int sum=0; //Iterate through all the values with respect to a key and sum up all of them for(longwritable i :value) sum = sum + 1; //Push to the output collector the Key and the obtained sum as value } } } context.write(key, new LongWritable(sum));

83 Driver import org.apache.hadoop.conf.configured; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner;

84 Driver public class KDriver extends Configured implements Tool public int run(string[] arg0) throws Exception { //creating a JobConf object and assigning a job name for identification purposes Job job= new Job(getConf(),"KMS"); // The default input format, "TextInputFormat," will load data in as (LongWritable, Text) pairs. The long value is the byte offset of the line in the file. // The basic (default) instance is TextOutputFormat, which writes (key, value) //pairs on individual lines of a text file. job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class);

85 Driver //Providing the mapper and reducer class names job.setmapperclass(kmapper.class); job.setreducerclass(kreducer.class); //Setting configuration object with the Data Type of output Key and Value for //map and reduce job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(longwritable.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(longwritable.class);

86 Driver //The hdfs input and output directory to be fetched FileInputFormat.addInputPath(job, new Path("input")); FileOutputFormat.setOutputPath(job, new Path("out")); //Setting The Jar File name to execute to run on Hadoop job.setjarbyclass(kdriver.class); //Display Logas and wait for the job to complete } job.waitforcompletion(true); return 0;

87 Driver //Map Reduce job is executed by useful hadoop utility class ToolRunner public static void main(string[] args) throws Exception { } } ToolRunner.run(new KDriver(), args);

88 Creating Input Directory/ File as sample.txt

89 Map Reduce Job Execution

90 Final Output

91 Create a text file with some text in Ubuntu system using >>nano filename Check if the file is created using >>ls Copy this file from Ubuntu to Hadoop using >>./hadoop fs copyfromlocal filename foldername For this command to run we must be in hadoop/bin as hadoop command is in bin folder

92 Goto hadoop folder as it contains wordcount jar file >>cd.. Execute the wordcount program >>bin/hadoop jar hadoop-examples jar wordcount foldername ouputfoldername Now Browse the localhost:50070/ /user/prygma and see that outputfolder is created. Click the outputfolder and then click part-r file and see the output

93 Now Browse the localhost:50070/ /user/prygma and see that outputfolder is created.

94 Click the outputfolder and then click part-r file and see the output

95 Session 4 Installation Hadoop Fully Distributed Node - Ms B. A. Khivasara and Ms K. R. Nirmal

96 Stop Hadoop if it is running >>stop-all.sh

97 In core-site.xml after configuration tag include : Replace Local host with IP address of cluster where namenode is Stored.

98 mapred-site.xml mapred.job.tracker indicates the location where the jobtracker must be installed : as here it is

99 hdfs.site.xml dfs.replication indicates property of setting the number of replications or clusters : as here it is 1 : We can specify our own also.

100 Password less ssh for Multi Node: Step # 1: Generate first ssh key generate your first public and private key on a local workstation. workstation#1 $ ssh-keygen -t rsa copy your public key to your remote server using scp scp user@remote.server.com:.ssh/authorized_keys ~/.ssh

101 Step # 2 : Generate next/multiple ssh key i. Login to 2nd workstation ii. Download original the authorized_keys file from remote server using scp workstation#2 $ scp user@remote.server.com:.ssh/authorized_keys ~/.ssh iii. Create the new pub/private key workstation#2 $ ssh-keygen -t rsa

102 iv. APPEND this key to the downloaded authorized_keys file using cat command workstation#2 $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys v. upload authorized_keys to remote server again workstation#2 $ scp ~/.ssh/authorized_keys user@remote.server.com:.ssh/ Repeat step #2 for each user or workstations for remote server.

103 Step #3: Test your setup try to login from Workstation #1, #2 and so on to remote server. You should not be asked for a password: workstation#1 $ ssh user@remote.server.com workstation#2 $ ssh user@remote.server.com

104 Running jps on namenode Running jps on datanode

105 Run jar file from any workstation

106

107 Lastly do not forget to stop hadoop using >>stop-all.sh

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.