Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.

Size: px

Start display at page:

Download "Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc."

Sophie Melton
6 years ago
Views:

1 D. Praveen Kumar Junior Research Fellow Department of Computer Science & Engineering Indian Institute of Technology (Indian School of Mines) Dhanbad, Jharkhand, India Head of IT & ITES, Skill Subsist Impels Ltd, Tirupati. March 25, 2017 March 25, 2017 Slide: 1 / 60

2 1 Introduction 2 Big Data 3 Sources of Big Data 4 Tools 5 HDFS 6 Installation 7 Configuration 8 Starting & Stopping 9 Map Reduce 10 Execution March 25, 2017 Slide: 2 / 60

3 Data Data means a value or set of values. Examples: march 1 st , 30, 40 ΨΦϕ March 25, 2017 Slide: 3 / 60

4 Information Meaningful or preprocessed data we called as Information. Examples: March 25, 2017 Slide: 4 / 60

5 Data Types The kind of data that may appear in a computer. Examples: int float char double Abstract data types -user defined data types. March 25, 2017 Slide: 5 / 60

6 Traditional approaches Traditional approaches to store and process the data 1 File system 2 RDBMS (Relational Database Management Systems) 3 Data Warehouse & Mining Tools 4 Grid Computing 5 Volunteer Computing March 25, 2017 Slide: 6 / 60

GUESTS =4 Transportation from railway station to your home( one Auto/car is sufficient) mom can prepare food or snacks without risk. Your house is sufficient for Accommodation.

7 GUESTS =4 Transportation from railway station to your home( one Auto/car is sufficient) mom can prepare food or snacks without risk. Your house is sufficient for Accommodation. Facilities like bed, bathrooms, water and TV are provided which you use. You can talk to each other and crack jokes and you can make them happy Expenditure is nearly Rs.1000/- March 25, 2017 Slide: 7 / 60

8 GUESTS =100 Transportation = 25 autos/car or two buses Food = catering. Accommodation = Lodge. Facilities = AC, TV, and all other facilities Maintenance= somewhat difficult Expenditure =nearly Rs. 90,000/- March 25, 2017 Slide: 8 / 60

9 GUESTS =10000 Transportation = 2500 autos or 500 buses Food = catering. Accommodation = all Lodges, function halls and cottages in the town. Facilities = AC, TV, and all other facilities are somewhat difficult to provide. Maintenance= more difficult Expenditure =nearly Rs. 2,00,000/- March 25, 2017 Slide: 9 / 60

10 Grid Computing March 25, 2017 Slide: 10 / 60

11 Volunteer Computing March 25, 2017 Slide: 11 / 60

GUESTS =10000000 Transportation=how many autos=? Food =?

12 GUESTS = Transportation=how many autos=? Food =? Accommodation =? Facilities =? Maintenance=? Cost =? March 25, 2017 Slide: 12 / 60

13 Problems Same we assume in computing environment Difficult to handle a huge and ever growing amount of data Processing of data can not be possible with few machines distributing large data sets is difficult Construction of online or offline models are very difficult March 25, 2017 Slide: 13 / 60

14 Solution A single solution to all these problems is March 25, 2017 Slide: 14 / 60

15 What is Big Data? Big data refers to voluminous amounts of structured or unstructured data that organizations can potentially mine and analyze. Big data is huge amount of large data sets characterized by March 25, 2017 Slide: 15 / 60

16 Data generation March 25, 2017 Slide: 16 / 60

17 How Data generated March 25, 2017 Slide: 17 / 60

18 Internet of Events Internet is the main source to generating the wast amount of data. March 25, 2017 Slide: 18 / 60

19 4 Internet of Events March 25, 2017 Slide: 19 / 60

20 4 Questions of Data Analysts 1 What happened? 2 Why did it happen? 3 What will happen? 4 What is the best that can happen? March 25, 2017 Slide: 20 / 60

21 Big Data Platforms and Analytical Software March 25, 2017 Slide: 21 / 60

22 Hadoop Here we go with March 25, 2017 Slide: 22 / 60

23 Hadoop History Hadoop was created by Doug Cutting, creator of Lucene. He also involved in a project called Nutch. (It is basic version of hadoop) Nutch is a combination of MapReduce and NDFS (Nutch Distributed File System) Later Nutch renamed to Hadoop. (Mapreduce + HDFS (Hadoop Distributed File System)) March 25, 2017 Slide: 23 / 60

24 Hadoop Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. March 25, 2017 Slide: 24 / 60

25 Hadoop The base Apache Hadoop framework is composed of the following modules: Hadoop Common contains libraries and utilities needed by other Hadoop modules Hadoop Distributed File System (HDFS) a distributed file-system that stores data Hadoop YARN a resource-management platform Hadoop MapReduce for large scale data processing. March 25, 2017 Slide: 25 / 60

26 Hadoop Components March 25, 2017 Slide: 26 / 60

27 Hadoop Components March 25, 2017 Slide: 27 / 60

28 HDFS- Goals The design goals of HDFS 1 Very Large files 2 Streaming Data Access 3 Commodity Hardware March 25, 2017 Slide: 28 / 60

29 HDFS- Failed in HDFS is Not FIT for 1 Lots of small files 2 Low latency database access 3 Multiple writers, arbitrary file modifications March 25, 2017 Slide: 29 / 60

30 HDFS- Concepts 1 Blocks 2 Namenodes 3 Datanodes 4 HDFS Federation 5 HDFS High Availability March 25, 2017 Slide: 30 / 60

31 Requirements Necessary Java >= 7 ssh Linux OS (Ubuntu >= 14.04) Hadoop framework Optional Eclipse Internet connection March 25, 2017 Slide: 31 / 60

32 Java 7 & Installation Hadoop requires a working Java installation. However, using java 1.7 or more is recommended. Following command is used to install java in linux platform sudo apt-get install openjdk-7-jdk (or) sudo apt-get install default-jdk March 25, 2017 Slide: 32 / 60

33 Java PATH Setup We need to set JAVA path Open the.bashrc file located in home directory gedit ~ /.bashrc Add below line at the end: export JAVA HOME=/usr/lib/jvm/java 7 openjdk amd64 March 25, 2017 Slide: 33 / 60

34 Installation & Configuration of SSH Hadoop requires SSH(Secure Shell) access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it. Install SSH using following command sudo apt-get install ssh First, we have to generate DSA an SSH key for user. ssh-keygen -t dsa -P -f ~ /.ssh/id dsa cat ~ /.ssh/id dsa.pub >> ~ /.ssh/authorized keys March 25, 2017 Slide: 34 / 60

35 Download & Extract Hadoop Download Hadoop from the Apache Download Mirrors Extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. $ cd /usr/local $ sudo tar xzf hadoop tar.gz $ sudo mv hadoop hadoop March 25, 2017 Slide: 35 / 60

36 Add Hadoop configuration in.bashrc Add Hadoop configuration in.bashrc in home directory. export HADOOP INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP INSTALL/bin export PATH=$PATH:$HADOOP INSTALL/sbin export HADOOP MAPRED HOME=$HADOOP INSTALL export HADOOP HDFS HOME=$HADOOP INSTALL export HADOOP COMMON HOME=$HADOOP INSTALL export YARN HOME=$HADOOP INSTALL export HADOOP COMMON LIB NATIVE DIR=$HADOOP INSTALL/lib/native export HADOOP OPTS="-Djava.library.path=$HADOOP INSTALL/lib" March 25, 2017 Slide: 36 / 60

37 Create temp file, DataNode & NameNode Execute below commands to create NameNode mkdir -p /usr/local/hadoopdata/hdfs/namenode Execute below commands to create DataNode mkdir -p /usr/local/hadoopdata/hdfs/datanode Execute below code to create the tmp directory in hadoop sudo mkdir -p /app/hadoop/tmp sudo chown hadoop1:hadoop1 /app/hadoop/tmp sudo chmod 750 /app/hadoop/tmp March 25, 2017 Slide: 37 / 60

38 Files to Configure The following are the files we need to configure core-site.xml hadoop-env.sh mapred-site.xml hdfs-site.xml March 25, 2017 Slide: 38 / 60

39 Add properties in /usr/local/hadoop/etc/core-site.xml Add the following snippets between the < configuration >... < /configuration > tags in the core-site.xml file. Add below property to specify the location of tmp < property > < name > hadoop.tmp.dir < /name > < value > /app/hadoop/tmp < /value > < /property > Add below property to specify the location of default file system and its port number. < property > < name > fs.default.name < /name > < value > hdfs : //localhost : 9000 < /value > < /property > March 25, 2017 Slide: 39 / 60

40 Add properties in /usr/local/hadoop/etc/hadoop-env.sh Un-Comment the JAVA HOME and Give Correct Path For Java. export JAVA HOME=/usr/lib/jvm/java-7-openjdk-amd64 March 25, 2017 Slide: 40 / 60

41 Add property in /usr/local/hadoop/etc/hadoop/mapred-site.xml In file we add The host name and port that the MapReduce job tracker runs at. Add following in mapred-site.xml : < property > < name > mapred.job.tracker < /name > < value > localhost : < /value > < /property > March 25, 2017 Slide: 41 / 60

42 Add properties in... etc/hadoop/hdfs-site.xml In file hdfs-site.xml add following: Add replication factor < property > < name > dfs.replication < /name > < value > 1 < /value > < /property > Specify the NameNode < property > < name > dfs.namenode.name.dir < /name > < value > file : /usr/local/hadoopdata/hdfs/namenode < /value > < /property > Specify the DataNode < property > < name > dfs.datanode.name.dir < /name > < value > file : /usr/local/hadoopdata/hdfs/datanode < /value > < /property > March 25, 2017 Slide: 42 / 60

43 Formatting the HDFS filesystem via the NameNode The first step to starting up your Hadoop installation is Formatting the Hadoop file system We need to do this the first time you set up a Hadoop. Do not format a running Hadoop filesystem as you will lose all the data currently in HDFS To format the filesystem, run the command hadoop namenode -format March 25, 2017 Slide: 43 / 60

44 Starting single-node cluster Run the command: start-all.sh This will startup a NameNode,SecondaryNameNode, DataNode, ResourceManager and a NodeManager on your machine. A nifty tool for checking whether the expected Hadoop processes are running is jps hadoop1@hadoop1:/usr/local/hadoop$ jps 2598 NameNode 3112 ResourceManager 3523 Jps 2917 SecondaryNameNode 2727 DataNode 3242 NodeManager March 25, 2017 Slide: 44 / 60

45 Stopping your single-node cluster Run the command stop-all.sh To stop all the daemons running on your machine output will be like this. stopping NodeManager localhost: stopping ResourceManager stopping NameNode localhost: stopping DataNode localhost: stopping SecondaryNameNode March 25, 2017 Slide: 45 / 60

46 Map-Reduce Framework Map Reduce programming paradigm It relies basically on two functions, Map and Reduce Map Reduce used to manage many large-scale computations The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. The framework to effectively schedule tasks on the nodes where data is already present March 25, 2017 Slide: 46 / 60

47 Map-Reduce Computation Steps The key-value pairs from each Map task are collected by a master controller and sorted by key. The keys are divided among all the Reduce tasks, so all key-value pairs with the same key wind up at the same Reduce task. The Reduce tasks work on one key at a time, and combine all the values associated with that key in some way. The manner of combination of values is determined by the code written by the user for the Reduce function. March 25, 2017 Slide: 47 / 60

48 Hadoop - MapReduce March 25, 2017 Slide: 48 / 60

49 Hadoop - MapReduce (Word Count) Example March 25, 2017 Slide: 49 / 60

50 MapReduce - WordCountMapper In WordCountMapper class we perform the following operations Read a line from file Split line into Words Assign Count 1 to each word March 25, 2017 Slide: 50 / 60

51 WordCountMapper source code public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } } March 25, 2017 Slide: 51 / 60

52 MapReduce - WordCountReducer In WordCountReducer class we perform the following operations Sum the list of values Assign sum to corresponding word March 25, 2017 Slide: 52 / 60

53 WordCountReducer source code public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } March 25, 2017 Slide: 53 / 60

54 WordCountJob public class WordCountJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "word count"); job.setjarbyclass(wordcountjob.class); job.setmapperclass(wordcountmapper.class); job.setcombinerclass(wordcountreducer.class); job.setreducerclass(wordcountreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); } } March 25, 2017 Slide: 54 / 60

55 Header Files to include import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.util.genericoptionsparser; March 25, 2017 Slide: 55 / 60

56 Execution of Hadoop Program in Eclipse Step1: 1 Starting Hadoop in terminal using command: $ Start-all.sh 2 Use JPS command to check all services of Hadoop are started or not. Step 2: open Eclipse Step 3: Go to file New Project Select Java Project and click on Next button Write project name and click on Finish button March 25, 2017 Slide: 56 / 60

57 Continue... Step 4: Right side it creates a project 1 Right click on Project New Class 2 Write Name of Class and then Click Finish 3 Write MapReduce program in that class Step 5: Write JAVA Program March 25, 2017 Slide: 57 / 60

58 Continue... Step 6: Importing JAR files 1 Right click on Project and select properties (Alt+Enter) 2 Select Java Build Path Click on Libraries, then click on add external JARS 3 Select the following jars from Hadoop library. /usr/local/hadoop/share/hadoop/common/libs /usr/local/hadoop/share/hadoop/hdfs/libs /usr/local/hadoop/share/hadoop/httpfs/libs /usr/local/hadoop/share/hadoop/mapreduce/libs /usr/local/hadoop/share/hadoop/yarn/libs /usr/local/hadoop/share/hadoop/tools/ March 25, 2017 Slide: 58 / 60

59 Continue... Step 7: Set input file path 1 Create folder in home dir 2 copy text files in to that 3 Select path of Input Step 8: Set input and output path 1 right click on source Run As Run Configuration Argument 2 Enter your input and out put path with a single space 3 click on Run March 25, 2017 Slide: 59 / 60

60 thank You March 25, 2017 Slide: 60 / 60

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example