a) Define Hadoop Ecosystem Ans: Hadoop eco system consist of two components Hadoop Distributed File System(HDFS)

Size: px

Start display at page:

Download "a) Define Hadoop Ecosystem Ans: Hadoop eco system consist of two components Hadoop Distributed File System(HDFS)"

Shona Whitehead
5 years ago
Views:

1 ADVANCED DATA ANALYTICS 14CS/IT 703. November, 2017 Seventh Semester Time: Three Hours IV/IV B.Tech (Supplementary) DEGREE EXAMINATION Computer Science Engineering Maximum : 60 Marks Answer Question No.1 compulsorily. (1X12 = 12 Marks) Answer ONE question from each unit. (4X12=48 Marks) 1. Answer all questions (1X12=12 Marks) a) Define Hadoop Ecosystem Ans: Hadoop eco system consist of two components Hadoop Distributed File System(HDFS) MapReduce Pig Hive Flume Sqoop Oozie b) What is the Default size of HDFS block. Ans: Default size of HDFS block is 64 MB but 128 MB is recommended. c) Define Mapper Function. Ans:Mapper function is to process the input data.the key-value pairs of Map taks are collected and sorted by key. d) How to run Pig un local execution environment. Ans: local mode: pig x local Mapreduce mode: pig x mapreduce e) List YARN Components Ans:client Resource Manager Node Manager Application Master

2 f) Define UDF. Ans: UDF s are user-defined functions, we can define our own functions and use them. The UDF support is provided in six programming languages, namely, Java, Jython, Python, JavaScript, Ruby and Groovy. g) Define Functionality of Pig Compiler. Ans:The compiler compiles the optimized logical plan into a series of MapReduce jobs Pig Hive h) Diff Apache Pig uses a language called Pig Latin. It was originally created at Yahoo. Hive uses a language called HiveQL. It was originally created at Facebook. eren ce Pig Latin is a data flow language. HiveQL is a query processing language. betw een Pig Pig Latin is a procedural language and it HiveQL is a query processing language. and fits in pipeline paradigm. Hive It process structured,semi structured and unstructured data. It process mostly structured data i) What is RDD. Resilient Distributed Datasets(RDD) is a fundamental data structure of spark.it is an immutable distributed collection of objects.each dataset in RDD is divided into logical partitions,which may compute on different clusters. j) Write syntax of load statement in HIVE. Ans:LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2...)] k) Difference between Spark and Map reduce. Ans: In Map reduce data is distributed over cluster and processed. But spark also distribute data over cluster and performs in-memory processing of data. l) Define Sqoop Import. Ans:Sqoop import tool imports individual tables from RDBMS to HDFS.

3 UNIT -1 2)a)Explain characteristics of Big Data? Volume: The Big word in Big data itself defines the volume. At present the data existing is in petabytes (10 15 ) and is supposed to increase to zettabytes (10 21 ) in nearby future. Data volume measures the amount of data available to an organization, which does not necessarily have to own all of it as long as it can access it. Velocity: Velocity in Big data is a concept which deals with the speed of the data coming from various sources. This characteristic is not being limited to the speed of incoming data but also speed at which the data flows and aggregated.

4 Variety: Data variety is a measure of the richness of the data representation text, images video, audio, etc. Data being produced is not of single category as it not only includes the traditional data but also the semi structured data from various resources like web Pages, Web Log Files, social media sites, , documents. Value: Data value measures the usefulness of data in making decisions. Data science is exploratory and useful in getting to know the data, but analytic science encompasses the predictive power of big data. User can run certain queries against the data stored and thus can deduct important results from the filtered data obtained and can also rank it according to the dimensions they require. These reports help these people to find the business trends according to which they can change their strategies. Veracity: it refers to the messiness or trustworthiness of the data. Today quality and accuracy of data are less controllable (hash tags, abbreviations, typos and colloquial speech) but technology now allows us to deal with it. How to find high-quality data from the vast collections of data that are out there on the Web. Complexity: Complexity measures the degree of interconnectedness (possibly very large) and interdependence in big data structures such that a small change (or combination of small changes) in one or a few elements can yield very large changes or a small change that ripple across or cascade through the system and substantially affect its behavior, or no change at all. 2b)How to install and configure Hadoop in Distrubuted Mode? Step-1:- 1)sudo add-apt-repository ppa:webupd8team/java 2)sudo apt-get update ->To install java 8th verrsion 3)sudo apt-get install oracle-java8-installer Step-2:- After installing Java we need to set Java Path for that: We need to set java path in bascrh file. To open bascrh file command is:-gedit ~/.bashrc ->Go to bottom of the file in bashrc and set below commnads For Java path setting command is:- export JAVA_HOME=our Java Path(Ex:/usr/lib/jvm/java-8 oracle) To Know the java path command is:- $JAVA_HOME(Type this on terminal)

5 step3:- Install SSH using following command sudo apt-get install ssh First, we have to generate DSA an SSH key for user. ssh-keygen -t dsa -P '' -f ~ /.ssh/id dsa cat ~ /.ssh/id dsa.pub >> ~ /.ssh/authorized keys step4:- ->We need to copy this Commands to bashrc file:-(on bascrh file) export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/native" Step5:- ->copy this Commandson terminal mkdir -p /usr/local/hadoopdata/hdfs/namenode mkdir -p /usr/local/hadoopdata/hdfs/datanode sudomkdir -p /app/hadoop/tmp sudochown hadoop1:hadoop1 /app/hadoop/tmp(insted of hadoop1:hadoop1 give your user name ex:harish:harish) sudochmod 750 /app/hadoop/tmp

6 step6:-(modify this files on /usr/local/hadoop/etc/hadoop) core-site.xml hadoop-env.sh mapred-site.xml hdfs-site.xml core-site.xml:-(copy below code in between <coniguration>...</coniguration>) <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> </property> hadoop-env.sh:- export JAVA_HOME=/usr/lib/jvm/java-8-oracle(Java path) mapred-site.xml(you need to remove template from extension):- <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> </property> hdfs-site.xml:- <property> <name>dfs.replication</name> <value>1</value> </property>

7 <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoopdata/hdfs/namenode</value> </property> <property> <name>dfs.datanode.name.dir</name> <value>file:/usr/local/hadoopdata/hdfs/datanode</value> </property> Step-7:- hadoopnamenode -format ->start-all.sh(this will start all the nodes) ->jps(to check all the nodes total:-6) If you get all the 6 nodes then your hadoop is Ready... 3)Discuss in detail about Word count program in Map Reduce with java code? Word Count Program WordCountMapper Class packagewc.hadoop.training.cse.bec; importjava.io.ioexception; importorg.apache.hadoop.io.intwritable; importorg.apache.hadoop.io.longwritable; importorg.apache.hadoop.io.text; importorg.apache.hadoop.mapreduce.mapper; public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private String tokens = "[_ $#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"']"; public void map(longwritablerecadd, Text rec, Context con) throws IOException, InterruptedException { String cleanline = rec.tostring().tolowercase().replaceall(tokens, " "); String[] words = cleanline.split(" "); for(string kw : words) {

8 con.write(new Text(kw.trim()), new IntWritable(1)); } } } WordCountReducer.Class packagewc.hadoop.training.cse.bec; importjava.io.ioexception; importjava.util.hashmap; importjava.util.map; importorg.apache.hadoop.io.intwritable; importorg.apache.hadoop.io.longwritable; importorg.apache.hadoop.io.text; importorg.apache.hadoop.mapreduce.reducer; public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private Map<String, Integer>countMap = new HashMap<String, Integer>(); public void reduce(text key, Iterable<IntWritable> values, Context con) throwsioexception, InterruptedException { int sum = 0; for (IntWritable el : values) { sum = sum + el.get(); } countmap.put(key.tostring(), new Integer(sum)); protected void cleanup(context con) throws IOException, InterruptedException { super.cleanup(con); Map<String, Integer>sortedMap = MiscUtils.sortByValues(countMap); // int counter = 0; for (String key : sortedmap.keyset())

9 { con.write(new Text(key), new IntWritable(sortedMap.get(key))); } } } WordCountJob.Class packagewc.hadoop.training.cse.bec; importorg.apache.hadoop.conf.configured; importorg.apache.hadoop.fs.path; importorg.apache.hadoop.io.intwritable; importorg.apache.hadoop.io.longwritable; importorg.apache.hadoop.io.text; importorg.apache.hadoop.mapreduce.job; importorg.apache.hadoop.mapreduce.lib.input.fileinputformat; importorg.apache.hadoop.mapreduce.lib.output.fileoutputformat; importorg.apache.hadoop.util.tool; importorg.apache.hadoop.util.toolrunner; public class WordCountJob extends Configured implements Tool { public static void main(string[] cla) throws Exception { intexitstatus = ToolRunner.run(new WordCountJob(), cla); System.exit(exitstatus); publicint run(string[] args) throws Exception { Job jb = Job.getInstance(getConf()); jb.setjobname("word Count"); jb.setmapperclass(wordcountmapper.class); jb.setreducerclass(wordcountreducer.class);

10 jb.setmapoutputkeyclass(text.class); jb.setmapoutputvalueclass(intwritable.class); jb.setoutputkeyclass(text.class); jb.setoutputvalueclass(intwritable.class); jb.setjarbyclass(wordcountjob.class); FileInputFormat.setInputPaths(jb, new Path(args[0])); FileOutputFormat.setOutputPath(jb, new Path(args[1])); returnjb.waitforcompletion(true)? 0 : 1; } } MISC_UTILS.CLASS packagewc.hadoop.training.cse.bec; importjava.util.collections; importjava.util.comparator; importjava.util.linkedhashmap; importjava.util.linkedlist; importjava.util.list; importjava.util.map; public class MiscUtils { public static <K extends Comparable, V extends Comparable> Map<K, V>sortByValues (Map<K, V> map) { List<Map.Entry<K, V>> entries = new LinkedList<Map.Entry<K, V>>( map.entryset()); Collections.sort(entries, new Comparator<Map.Entry<K, V>>() publicint compare(map.entry<k, V> o1, Map.Entry<K, V> o2) { return o2.getvalue().compareto(o1.getvalue()); } }); // LinkedHashMap will keep the keys in the order they are inserted // which is currently sorted on natural ordering Map<K, V>sortedMap = new LinkedHashMap<K, V>();

11 for (Map.Entry<K, V> entry : entries) { sortedmap.put(entry.getkey(), entry.getvalue()); } returnsortedmap; } } UNIT-2 4)a)Expalin anatomy of how job runs in map reduce?

12 1)Job Submission 2)Job Initialization 3)Task Assignment 4)Task Execution 5)Progress and Updates 6)Job Completion 7)Failures 1)Job Submission The submit() method on Job creates an internal Job Submitter instance and calls submit Job Internal() on it. Having submitted the job, waitforcompletion() polls the job s progress once per second and reports the progress to the console if it has changed since the last report. When the job completes successfully, the job counters are displayed. Otherwise, the error that caused the job to fail is logged to the console. The job submission process implemented by JobSubmitter does the following: 1. Asks the resource manager for a new application ID, used for the MapReduce jobid (step 2). 2. Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program. 3. Computes the input splits for the job. If the splits cannot be computed (because the input paths don t exist, for example), the job is not submitted and an error is thrown to the MapReduce program. 4. Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the shared filesystem in a directory named after the job ID (step 3). The job JAR is copied with a high replication factor (controlled by the mapreduce.client.submit.file.replication property, which defaults to 10) so that there are lots of copies across the cluster for the node managers to access when they run tasks for the job. 5. Submits the job by calling submitapplication() on the resource manager(step 4). 2)Job Initialization The application master must decide how to run the tasks that make up the MapReduce job. If the job is small, the application master may choose to run the tasks in the same JVM as itself.

13 This happens when it judges that the overhead of allocating and running tasks in new containers outweighs the gain to be had in running them in parallel, compared to running them sequentially on one node. Such a job is said to be uberized, or run as an uber task. What qualifies as a small job? By default, a small job is one that has less than 10 mappers, only one reducer, and an input size that is less than the size of one HDFS block. Finally, before any tasks can be run, the application master calls the setupjob() method on the OutputCommitter. For FileOutputCommitter, which is the default, it will create the final output directory for the job and the temporary working space for the task output. 3)TaskAssaignment If the job does not qualify for running as an uber task, then the application master requests containers for all the map and reduce tasks in the job from the resource manager(step 8). Requests for map tasks are made first and with a higher priority than those for reduce tasks, since all the map tasks must complete before the sort phase of the reduce can start. Requests for reduce tasks are not made until 5% of map tasks have completed. Reduce tasks can run anywhere in the cluster, but requests for map tasks have data locality constraints that the scheduler tries to honor. Requests also specify memory requirements and CPUs for tasks. By default, each map and reduce task is allocated 1,024 MB of memory and one virtual core. The values are configurable on a per-job basis via the following properties: mapreduce.map.memory.mb, mapreduce.reduce.memory.mb, mapreduce.map.cpu.vcores mapreduce.reduce.cpu.vcores. 4)Task Execution Once a task has been assigned resources for a container on a particular node by the resource manager s scheduler, the application master starts the container by contacting the node manager (steps 9a and 9b). The task is executed by a Java application whose main class is YarnChild. Before it can run the task, it localizes the resources that the task needs, including the job configuration and JAR file, and any files from the distributed cache. Finally, it runs the map or reduce task (step 11).

The YarnChild runs in a dedicated JVM, so that any bugs in the user-defined map and reduce functions (or even in YarnChild) don t affect the node manager by causing it to crash or hang.

14 The YarnChild runs in a dedicated JVM, so that any bugs in the user-defined map and reduce functions (or even in YarnChild) don t affect the node manager by causing it to crash or hang. Streaming Streaming runs special map and reduce tasks for the purpose of launching the usersupplied executable and communicating with it. 5)Progress and Status MapReduce jobs are long-running batch jobs, taking anything from tens of seconds to hours to run. A job and each of its tasks have a status, which includes such things as the state of the job or task the progress of maps and reduces, the values of the job s counters, and a status message or description. When a task is running, it keeps track of its progress. For map tasks, this is the proportion of the input that has been processed. For reduce tasks, it s a little more complex. 6)Job Completion When the application master receives a notification that the last task for a job is complete, it changes the status for the job to successful. when the Job polls for status, it learns that the job has completed successfully, so it prints a message to tell the user and then returns from the waitforcompletion() method. Job statistics and counters are printed to the console at this point. Finally, on job completion, the application master and the task containers clean up their working state (so intermediate output is deleted), and the OutputCommitter s commit Job() method is

15 7)Failures called. Job information is archived by the job history server to enable later interrogation by users if desired. In the real world, user code is buggy, processes crash, and machines fail. One of the major benefits of using Hadoop is its ability to handle such failures and allow your job to complete successfully. We need to consider the failure of any of the following entities: The task The application master The node manager, and The resource manager. 4b) Explain HDFS commands in detail? 1) Print the hadoop version Syntax: hadoop version ex: hadoop version 2) To create a directory Syntax: hadoopfs mkdir [-p] <path> -p: create parent directory along the path Ex: hadoopfs mkdir /hadoop/bec 3) List the contents in human readable format Syntax: hadoopfs ls [-R][_h] <path> Ex: hadoopfs ls /hadoop 4) Upload a file Syntax: hadoopfs put <local src><dest> Ex: hadoopfs put /home/ubuntu/desktop/hadoop.txt /hadoop/bec/hadop.txt 5) Download a file Syntax: hadoopfs get <src><local dest> Ex: hadoopfs -get /hadoop/bec/jes.txt.home/lavanya/doc.txt 6) View the content of the file Syntax: hadoopfs cat <path (filename)> Ex: hadoopfs cat /hadoop/bec/hadop.txt

16 7) Copy the file from src to destination within hdfs Syntax: hadoopfs cp [-f] <hdfssrc><hdfsdest> -f: to overwrite the file if already exists Ex: hadoopfs cp /hadoop/fds.txt /hadoop/bec/ 8) Move the file from src to destination within the hdfs Syntax: hadoopfs mv /hadoop/bec/hadop.txt /hadoop/ Ex: hadoopfs mv /hadoop/bec/hadoop.txt /hadoop/ 9) Copy from local to hdfs: Syntax: hadoopfs copyfromlocal[-f] <local src><dest> -f: to overwrite the file if already exists Ex: hadoopfs copyfromlocal /home/ubuntu/desktop/hadoop.txt /hadoop/bec/hadop.txt 10) Copy to local from hdfs Syntax: hadoopfs copytolocal[-f] <hdfssrc><dest> -f: to overwrite the file if already exists Ex: hadoopfs copytolocal /hadoop/bec/jes.txt.home/lavanya/doc.txt 11) To remove a directory or folder from the hdfs Syntax: hadoopfs rm [-R] [-skiptrash] <path> -skiptrash: to delete permanently To remove text file Ex: hadoopfs rm /hadoop/bec/hadop.txt To remove a folder Ex: hadoopfs rm R /hadoop/bec 12) To delete non empty files Syntax: hadoopfs rmdir [--ignore-fail-0n-non-empty] <path> --ignore-fail-0n-non-empty: don t fail even some fails are contain. Ex: hadoopfs rmdir /hadoop/bec/cse 13) Display last few lines of a file in hdfs Syntax: hadoopfs tail [-f] <path>

17 -f: will output the data as the file grows Ex: hadoopfs tail /hadoop/bec/hadop.txt 14) Display disk usage of files and directories Syntax: hadoopfs du [-s] [-h] <path> -s: give aggregate summary of file length -h: human readable format file Ex: hadoopfs du /hadoop/bec/hadop.txt 15) To empty the trash of hdfs Syntax: hadoopfs expunge Ex: hadoopfs expunge 5)a)Discuss about how applications runs on YARN?

YARN provides its core services via two types of long-running daemon: a resource manager (one per cluster) to manage the use of resources across the cluster, and node managers running on all the

18 YARN provides its core services via two types of long-running daemon: a resource manager (one per cluster) to manage the use of resources across the cluster, and node managers running on all the nodes in the cluster to launch and monitor containers. A container executes an application-specific process with a constrained set of resources (memory, CPU, and so on). To run an application on YARN, a client contacts the resource manager and asks it to run an application master process (step 1). The resource manager then finds a node manager that can launch the application master in a container (steps 2a and 2b). Precisely what the application master does once it is running depends on the application. It could simply run a computation in the container it is running in and return the result to the client. Or it could request more containers from the resource managers (step 3), and use them to run a distributed computation (steps 4a and 4b). YARN itself does not provide any way for the parts of the application (client, master, process) to communicate with one another. Most nontrivial YARN applications use some form of remote communication (such as Hadoop s RPC layer) to pass status updates and results back to the client, but these are specific to the application.

19 YARN has a flexible model for making resource requests. A request for a set of containers can express the amount of computer resources required for each container (memory and CPU), as well as locality constraints for the containers in that request. Locality is critical in ensuring that distributed data processing algorithms use the cluster bandwidth efficiently, so YARN allows an application to specify locality constraints for the containers it is requesting. Locality constraints can be used to request a container on a specific node or rack, or anywhere on the cluster (off-rack). Sometimes the locality constraint cannot be met, in which case either no allocation is made or, optionally, the constraint can be loosened. For example, if a specific node was requested but it is not possible to start a container on it (because other containers are running on it), then YARN will try to start a container on a node in the same rack, or,if that s not possible, on any node in the cluster. In the common case of launching a container to process an HDFS block (to run a map task in MapReduce, say), the application will request a container on one of the nodes hosting the block s three replicas, or on a node in one of the racks hosting the replicas,or, failing that, on any node in the cluster. A YARN application can make resource requests at any time while it is running. For example, an application can make all of its requests up front, or it can take a more dynamic approach whereby it requests more resources dynamically to meet the changing needs of the application. 5)b)Differentiate between YARN and Map Reduce? In MapReduce 1, there are two types of daemon that control the job execution process:ajobtrackerand one or more tasktrackers. The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers run tasks andsend progress reports to the jobtracker, which keeps a record of the overall progress ofeach job. If a task fails, the jobtracker can reschedule it on a different tasktracker.in MapReduce 1, the jobtracker takes care of both job scheduling and task progress monitoring By contrast,in YARN these responsibilities are handled by separate entities: the resource manager and an application master (one for each MapReduce job). The jobtracker is also responsible for storing job history for completed jobs, although it is possible to run ajob history server as a separate daemon to take the load off the jobtracker. In YARN, the equivalent role is the timeline server, which stores application history.the YARN equivalent of a tasktracker is a node manager YARN was designed to address many of the limitations in MapReduce 1. The benefits to using YARN include the following: Scalability YARN can run on larger clusters than MapReduce 1. MapReduce 1 hits scalability bottlenecks in the region of 4,000 nodes and 40,000 tasks,6stemming from the fact that the jobtracker has to manage both jobs and tasks. YARN overcomes these limitations by virtue of its split resource manager/application master architecture: it is designed to scale up to 10,000 nodes and 100,000 tasks.in contrast to the jobtracker, each instance of an application here, a MapReduce job has a dedicated application master, which runs for the duration of the application.this model is actually closer to the original Google MapReduce paper, which describes how a master process is started to coordinate map and reduce tasks running on a set of workers.

20 Availability High availability (HA) is usually achieved by replicating the state needed for another.daemon to take over the work needed to provide the service, in the event of the service daemon failing. However, the large amount of rapidly changing complexstate in the jobtracker s memory makes it very difficult to retrofit HA into the jobtrackerservice.with the jobtracker s responsibilities split between the resource manager and application master in YARN, making the service highly available became a divideand-conquer problem: provide HA for the resource manager, then for YARN applications And indeed, Hadoop 2 supports HA bothfor the resource manager and for the application master for MapReduce jobs Utilization In MapReduce 1, each tasktracker is configured with a static allocation of fixed-size slots, which are divided into map slots and reduce slots at configuration time. A map slot can only be used to run a map task, and a reduce slot can only be used for a reduce task. In YARN, a node manager manages a pool of resources, rather than a fixed number of designated slots. MapReduce running on YARN will not hit the situation where a reduce task has to wait because only map slots are available on the cluster, which can happen in MapReduce 1. If the resources to run the task are available, then the application will be eligible for them. Furthermore, resources in YARN are fine grained, so an application can make a request for what it needs, rather than for an indivisible slot, which may be too big or too small for the particular task. Multitenancy In some ways, the biggest benefit of YARN is that it opens up Hadoop to other types of distributed application beyond MapReduce. MapReduce is just one YARN application among many. It is even possible for users to run different versions of MapReduce on the same YARN cluster, which makes the process of upgrading MapReduce more manageable. UNIT-3 6)a)Explain about Architecture of pig and Explain its components? Parser:

21 Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges. Optimizer : The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown. Compiler : The compiler compiles the optimized logical plan into a series of MapReduce jobs. Execution engine : Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results. 6b)Explain PIG Latin scripting for word count program? wordcount.pig lines = load '/home/ubuntu/desktop/wordcount.txt' as (line: chararray); words = FOREACH lines GENERATE FLATTEN (TOKENIZE (line)) as word; grouped = group words by word; wordcount = FOREACH grouped GENERATE group, COUNT (words); dumpwordcount; Goto terminal Grunt> exec /home/ubuntu/desktop/wordcount.pig 7)Explain about hive working and hive architecture in detail?

Hive working: 1)Execute Query : The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.

22 Hive working: 1)Execute Query : The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute. 2)Get Plan : The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query. 3 )Get Metadata: The compiler sends metadata request to Metastore (any database). 4) Send Metadata: 5)Send Plan: Metastore sends metadata as a response to the compiler. The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete. 6) Execute Plan:

23 The driver sends the execute plan to the execution engine. 7 )Execute Job: Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job. 7.1) Metadata Ops: 8 )Fetch Result: 9) Send Results: Meanwhile in execution, the execution engine can execute metadata operations with Metastore. The execution engine receives the results from Data nodes. The execution engine sends those resultant values to the driver. 10 )Send Results: The driver sends the results to Hive Interfaces.: Hive Architecture: User Interface: Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server). Meta Store :

Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types,

It is one of the replacements of traditional approach for MapReduce program.

Execution Engine : The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine.

24 Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping. HiveQL Process Engine: HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it. Execution Engine : The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce. HDFS or HBASE: system. Hadoop distributed file system or HBASE are the data storage techniques to store data into file UNIT-4 8)Explain how job runs in SPARK?

25 At the highest level, there are two independent entities: the driver, which hosts the application (SparkContext) and schedules tasks for a job; and the executors, which are exclusive to the application, run for the duration of the application, and execute the application s tasks. Usually the driver runs as a client that is not managed by the cluster manager and the executors run on machines in the cluster. Job Submission : Spark job is submitted automatically when an action (such as count()) is performed on an RDD. Internally, this causes runjob() to be called on the SparkContext(step 1), which passes the call on to the scheduler that runs as a part of the driver (step 2). The scheduler is made up of two parts: a DAG scheduler that breaks down the job into a DAG of stages, and a task scheduler that is responsible for submitting the tasks from each stage to the cluster. DAG Construction : To understand how a job is broken up into stages, we need to look at the type of tasks that can run in a stage. There are two types: shuffle map tasks and result tasks. The name of the task type indicates what Spark does with the task s output: Shuffle map tasks As the name suggests, shuffle map tasks are like the map-side part of the shuffle in MapReduce. Each shuffle map task runs a computation on one RDD partition and, based on a partitioning function, writes its output to a new set of partitions, which are then fetched in a later stage (which could be composed of either shuffle map tasks or result tasks). Shuffle map tasks run in all stages except the final stage. Result tasks Result tasks run in the final stage that returns the result to the user s program (such as the result of a count()). Each result task runs a computation on its RDD partition,then sends the result back to the driver, and the driver assembles the results from each partition into a final result (which may be Unit, in the case of actions like saveastextfile()). The simplest Spark job is one that does not need a shuffle and therefore has just a single stage composed of result tasks. This is like a map-only job in MapReduce.

Task Scheduling When the task scheduler is sent a set of tasks, it uses its list of executors that are running for the application and constructs a mapping of tasks to executors that takes placement

Next, the task scheduler assigns tasks to executors that have free cores (this may not be the complete set if another job in the same application is running), and it continues to assign more tasks as

26 Task Scheduling When the task scheduler is sent a set of tasks, it uses its list of executors that are running for the application and constructs a mapping of tasks to executors that takes placement preferences into account. Next, the task scheduler assigns tasks to executors that have free cores (this may not be the complete set if another job in the same application is running), and it continues to assign more tasks as executors finish running tasks, until the task set is complete. Each task is allocated one core by default, although this can be changed by setting spark.task.cpus. Task Execution An executor runs a task as follows (step 7). First, it makes sure that the JAR and file dependencies for the task are up to date. The executor keeps a local cache of all the dependencies that previous tasks have used, so that it only downloads them when they have changed. Second, it deserializes the task code (which includes the user s functions) from the serialized bytes that were sent as a part of the launch task message. Third, the task code is executed. Note that tasks are run in the same JVM as the executor, so there is no process overhead for task launch.tasks can return a result to the driver. The result is serialized and sent to the executor backend, and then back to the driver as a status update message. A shuffle map task returns information that allows the next stage to retrieve the output partitions, while a result task returns the value of the result for the partition it ran on, which the driver assembles into a final result to return to the user s program.

27 9)a)Explain in detail about sqoop Import and Export statements in detail? 9b)Expalin about Functionality of import and export functions in Sqoop with neat diagrams

28 The Import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. All records are stored as text data in the text files or as binary data in Avro and Sequence files. Syntax The following syntax is used to import data into HDFS. $ sqoop import (generic-args) (import-args) Importing a Table Sqoop tool import is used to import table data from the table to the Hadoop file system as a text file or a binary file. Import an entire table: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table cities Import a subset of data:

$sqoop import \ --connect jdbc:mysql://mysql.example.$

29 sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table cities \ --where "country = 'USA'" IMPORT-ALL-TABLES The following syntax is used to import all tables. $ sqoop import-all-tables (generic-args) (import-args) $ sqoop-import-all-tables (generic-args) (import-args) The following command is used to import all the tables from the userdb database. $ sqoop import \ --connect jdbc:mysql://localhost/userdb \ --username root The following command is used to verify all the table data to the userdb database in HDFS. $ $HADOOP_HOME/bin/hadoopfs ls The default operation is to insert all the record from the input files to the database table using the INSERT statement.

30 In update mode, Sqoop generates the UPDATE statement that replaces the existing record into the database. The following is the syntax for the export command. $ sqoop export (generic-args) (export-args) $ sqoop-export (generic-args) (export-args) The following command is used to export the table data (which is in emp_data file on HDFS) to the employee table in db database of Mysql database server. $ sqoop export \ --connect jdbc:mysql://localhost/db \ --username root \ --table employee \ --export-dir /emp/emp_data

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions: