a) Define Hadoop Ecosystem Ans: Hadoop eco system consist of two components Hadoop Distributed File System(HDFS)

Size: px
Start display at page:

Download "a) Define Hadoop Ecosystem Ans: Hadoop eco system consist of two components Hadoop Distributed File System(HDFS)"

Transcription

1 ADVANCED DATA ANALYTICS 14CS/IT 703. November, 2017 Seventh Semester Time: Three Hours IV/IV B.Tech (Supplementary) DEGREE EXAMINATION Computer Science Engineering Maximum : 60 Marks Answer Question No.1 compulsorily. (1X12 = 12 Marks) Answer ONE question from each unit. (4X12=48 Marks) 1. Answer all questions (1X12=12 Marks) a) Define Hadoop Ecosystem Ans: Hadoop eco system consist of two components Hadoop Distributed File System(HDFS) MapReduce Pig Hive Flume Sqoop Oozie b) What is the Default size of HDFS block. Ans: Default size of HDFS block is 64 MB but 128 MB is recommended. c) Define Mapper Function. Ans:Mapper function is to process the input data.the key-value pairs of Map taks are collected and sorted by key. d) How to run Pig un local execution environment. Ans: local mode: pig x local Mapreduce mode: pig x mapreduce e) List YARN Components Ans:client Resource Manager Node Manager Application Master

2 f) Define UDF. Ans: UDF s are user-defined functions, we can define our own functions and use them. The UDF support is provided in six programming languages, namely, Java, Jython, Python, JavaScript, Ruby and Groovy. g) Define Functionality of Pig Compiler. Ans:The compiler compiles the optimized logical plan into a series of MapReduce jobs Pig Hive h) Diff Apache Pig uses a language called Pig Latin. It was originally created at Yahoo. Hive uses a language called HiveQL. It was originally created at Facebook. eren ce Pig Latin is a data flow language. HiveQL is a query processing language. betw een Pig Pig Latin is a procedural language and it HiveQL is a query processing language. and fits in pipeline paradigm. Hive It process structured,semi structured and unstructured data. It process mostly structured data i) What is RDD. Resilient Distributed Datasets(RDD) is a fundamental data structure of spark.it is an immutable distributed collection of objects.each dataset in RDD is divided into logical partitions,which may compute on different clusters. j) Write syntax of load statement in HIVE. Ans:LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2...)] k) Difference between Spark and Map reduce. Ans: In Map reduce data is distributed over cluster and processed. But spark also distribute data over cluster and performs in-memory processing of data. l) Define Sqoop Import. Ans:Sqoop import tool imports individual tables from RDBMS to HDFS.

3 UNIT -1 2)a)Explain characteristics of Big Data? Volume: The Big word in Big data itself defines the volume. At present the data existing is in petabytes (10 15 ) and is supposed to increase to zettabytes (10 21 ) in nearby future. Data volume measures the amount of data available to an organization, which does not necessarily have to own all of it as long as it can access it. Velocity: Velocity in Big data is a concept which deals with the speed of the data coming from various sources. This characteristic is not being limited to the speed of incoming data but also speed at which the data flows and aggregated.

4 Variety: Data variety is a measure of the richness of the data representation text, images video, audio, etc. Data being produced is not of single category as it not only includes the traditional data but also the semi structured data from various resources like web Pages, Web Log Files, social media sites, , documents. Value: Data value measures the usefulness of data in making decisions. Data science is exploratory and useful in getting to know the data, but analytic science encompasses the predictive power of big data. User can run certain queries against the data stored and thus can deduct important results from the filtered data obtained and can also rank it according to the dimensions they require. These reports help these people to find the business trends according to which they can change their strategies. Veracity: it refers to the messiness or trustworthiness of the data. Today quality and accuracy of data are less controllable (hash tags, abbreviations, typos and colloquial speech) but technology now allows us to deal with it. How to find high-quality data from the vast collections of data that are out there on the Web. Complexity: Complexity measures the degree of interconnectedness (possibly very large) and interdependence in big data structures such that a small change (or combination of small changes) in one or a few elements can yield very large changes or a small change that ripple across or cascade through the system and substantially affect its behavior, or no change at all. 2b)How to install and configure Hadoop in Distrubuted Mode? Step-1:- 1)sudo add-apt-repository ppa:webupd8team/java 2)sudo apt-get update ->To install java 8th verrsion 3)sudo apt-get install oracle-java8-installer Step-2:- After installing Java we need to set Java Path for that: We need to set java path in bascrh file. To open bascrh file command is:-gedit ~/.bashrc ->Go to bottom of the file in bashrc and set below commnads For Java path setting command is:- export JAVA_HOME=our Java Path(Ex:/usr/lib/jvm/java-8 oracle) To Know the java path command is:- $JAVA_HOME(Type this on terminal)

5 step3:- Install SSH using following command sudo apt-get install ssh First, we have to generate DSA an SSH key for user. ssh-keygen -t dsa -P '' -f ~ /.ssh/id dsa cat ~ /.ssh/id dsa.pub >> ~ /.ssh/authorized keys step4:- ->We need to copy this Commands to bashrc file:-(on bascrh file) export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/native" Step5:- ->copy this Commandson terminal mkdir -p /usr/local/hadoopdata/hdfs/namenode mkdir -p /usr/local/hadoopdata/hdfs/datanode sudomkdir -p /app/hadoop/tmp sudochown hadoop1:hadoop1 /app/hadoop/tmp(insted of hadoop1:hadoop1 give your user name ex:harish:harish) sudochmod 750 /app/hadoop/tmp

6 step6:-(modify this files on /usr/local/hadoop/etc/hadoop) core-site.xml hadoop-env.sh mapred-site.xml hdfs-site.xml core-site.xml:-(copy below code in between <coniguration>...</coniguration>) <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> </property> hadoop-env.sh:- export JAVA_HOME=/usr/lib/jvm/java-8-oracle(Java path) mapred-site.xml(you need to remove template from extension):- <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> </property> hdfs-site.xml:- <property> <name>dfs.replication</name> <value>1</value> </property>

7 <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoopdata/hdfs/namenode</value> </property> <property> <name>dfs.datanode.name.dir</name> <value>file:/usr/local/hadoopdata/hdfs/datanode</value> </property> Step-7:- hadoopnamenode -format ->start-all.sh(this will start all the nodes) ->jps(to check all the nodes total:-6) If you get all the 6 nodes then your hadoop is Ready... 3)Discuss in detail about Word count program in Map Reduce with java code? Word Count Program WordCountMapper Class packagewc.hadoop.training.cse.bec; importjava.io.ioexception; importorg.apache.hadoop.io.intwritable; importorg.apache.hadoop.io.longwritable; importorg.apache.hadoop.io.text; importorg.apache.hadoop.mapreduce.mapper; public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private String tokens = "[_ $#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"']"; public void map(longwritablerecadd, Text rec, Context con) throws IOException, InterruptedException { String cleanline = rec.tostring().tolowercase().replaceall(tokens, " "); String[] words = cleanline.split(" "); for(string kw : words) {

8 con.write(new Text(kw.trim()), new IntWritable(1)); } } } WordCountReducer.Class packagewc.hadoop.training.cse.bec; importjava.io.ioexception; importjava.util.hashmap; importjava.util.map; importorg.apache.hadoop.io.intwritable; importorg.apache.hadoop.io.longwritable; importorg.apache.hadoop.io.text; importorg.apache.hadoop.mapreduce.reducer; public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private Map<String, Integer>countMap = new HashMap<String, Integer>(); public void reduce(text key, Iterable<IntWritable> values, Context con) throwsioexception, InterruptedException { int sum = 0; for (IntWritable el : values) { sum = sum + el.get(); } countmap.put(key.tostring(), new Integer(sum)); protected void cleanup(context con) throws IOException, InterruptedException { super.cleanup(con); Map<String, Integer>sortedMap = MiscUtils.sortByValues(countMap); // int counter = 0; for (String key : sortedmap.keyset())

9 { con.write(new Text(key), new IntWritable(sortedMap.get(key))); } } } WordCountJob.Class packagewc.hadoop.training.cse.bec; importorg.apache.hadoop.conf.configured; importorg.apache.hadoop.fs.path; importorg.apache.hadoop.io.intwritable; importorg.apache.hadoop.io.longwritable; importorg.apache.hadoop.io.text; importorg.apache.hadoop.mapreduce.job; importorg.apache.hadoop.mapreduce.lib.input.fileinputformat; importorg.apache.hadoop.mapreduce.lib.output.fileoutputformat; importorg.apache.hadoop.util.tool; importorg.apache.hadoop.util.toolrunner; public class WordCountJob extends Configured implements Tool { public static void main(string[] cla) throws Exception { intexitstatus = ToolRunner.run(new WordCountJob(), cla); System.exit(exitstatus); publicint run(string[] args) throws Exception { Job jb = Job.getInstance(getConf()); jb.setjobname("word Count"); jb.setmapperclass(wordcountmapper.class); jb.setreducerclass(wordcountreducer.class);

10 jb.setmapoutputkeyclass(text.class); jb.setmapoutputvalueclass(intwritable.class); jb.setoutputkeyclass(text.class); jb.setoutputvalueclass(intwritable.class); jb.setjarbyclass(wordcountjob.class); FileInputFormat.setInputPaths(jb, new Path(args[0])); FileOutputFormat.setOutputPath(jb, new Path(args[1])); returnjb.waitforcompletion(true)? 0 : 1; } } MISC_UTILS.CLASS packagewc.hadoop.training.cse.bec; importjava.util.collections; importjava.util.comparator; importjava.util.linkedhashmap; importjava.util.linkedlist; importjava.util.list; importjava.util.map; public class MiscUtils { public static <K extends Comparable, V extends Comparable> Map<K, V>sortByValues (Map<K, V> map) { List<Map.Entry<K, V>> entries = new LinkedList<Map.Entry<K, V>>( map.entryset()); Collections.sort(entries, new Comparator<Map.Entry<K, V>>() publicint compare(map.entry<k, V> o1, Map.Entry<K, V> o2) { return o2.getvalue().compareto(o1.getvalue()); } }); // LinkedHashMap will keep the keys in the order they are inserted // which is currently sorted on natural ordering Map<K, V>sortedMap = new LinkedHashMap<K, V>();

11 for (Map.Entry<K, V> entry : entries) { sortedmap.put(entry.getkey(), entry.getvalue()); } returnsortedmap; } } UNIT-2 4)a)Expalin anatomy of how job runs in map reduce?

12 1)Job Submission 2)Job Initialization 3)Task Assignment 4)Task Execution 5)Progress and Updates 6)Job Completion 7)Failures 1)Job Submission The submit() method on Job creates an internal Job Submitter instance and calls submit Job Internal() on it. Having submitted the job, waitforcompletion() polls the job s progress once per second and reports the progress to the console if it has changed since the last report. When the job completes successfully, the job counters are displayed. Otherwise, the error that caused the job to fail is logged to the console. The job submission process implemented by JobSubmitter does the following: 1. Asks the resource manager for a new application ID, used for the MapReduce jobid (step 2). 2. Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program. 3. Computes the input splits for the job. If the splits cannot be computed (because the input paths don t exist, for example), the job is not submitted and an error is thrown to the MapReduce program. 4. Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the shared filesystem in a directory named after the job ID (step 3). The job JAR is copied with a high replication factor (controlled by the mapreduce.client.submit.file.replication property, which defaults to 10) so that there are lots of copies across the cluster for the node managers to access when they run tasks for the job. 5. Submits the job by calling submitapplication() on the resource manager(step 4). 2)Job Initialization The application master must decide how to run the tasks that make up the MapReduce job. If the job is small, the application master may choose to run the tasks in the same JVM as itself.

13 This happens when it judges that the overhead of allocating and running tasks in new containers outweighs the gain to be had in running them in parallel, compared to running them sequentially on one node. Such a job is said to be uberized, or run as an uber task. What qualifies as a small job? By default, a small job is one that has less than 10 mappers, only one reducer, and an input size that is less than the size of one HDFS block. Finally, before any tasks can be run, the application master calls the setupjob() method on the OutputCommitter. For FileOutputCommitter, which is the default, it will create the final output directory for the job and the temporary working space for the task output. 3)TaskAssaignment If the job does not qualify for running as an uber task, then the application master requests containers for all the map and reduce tasks in the job from the resource manager(step 8). Requests for map tasks are made first and with a higher priority than those for reduce tasks, since all the map tasks must complete before the sort phase of the reduce can start. Requests for reduce tasks are not made until 5% of map tasks have completed. Reduce tasks can run anywhere in the cluster, but requests for map tasks have data locality constraints that the scheduler tries to honor. Requests also specify memory requirements and CPUs for tasks. By default, each map and reduce task is allocated 1,024 MB of memory and one virtual core. The values are configurable on a per-job basis via the following properties: mapreduce.map.memory.mb, mapreduce.reduce.memory.mb, mapreduce.map.cpu.vcores mapreduce.reduce.cpu.vcores. 4)Task Execution Once a task has been assigned resources for a container on a particular node by the resource manager s scheduler, the application master starts the container by contacting the node manager (steps 9a and 9b). The task is executed by a Java application whose main class is YarnChild. Before it can run the task, it localizes the resources that the task needs, including the job configuration and JAR file, and any files from the distributed cache. Finally, it runs the map or reduce task (step 11).

14 The YarnChild runs in a dedicated JVM, so that any bugs in the user-defined map and reduce functions (or even in YarnChild) don t affect the node manager by causing it to crash or hang. Streaming Streaming runs special map and reduce tasks for the purpose of launching the usersupplied executable and communicating with it. 5)Progress and Status MapReduce jobs are long-running batch jobs, taking anything from tens of seconds to hours to run. A job and each of its tasks have a status, which includes such things as the state of the job or task the progress of maps and reduces, the values of the job s counters, and a status message or description. When a task is running, it keeps track of its progress. For map tasks, this is the proportion of the input that has been processed. For reduce tasks, it s a little more complex. 6)Job Completion When the application master receives a notification that the last task for a job is complete, it changes the status for the job to successful. when the Job polls for status, it learns that the job has completed successfully, so it prints a message to tell the user and then returns from the waitforcompletion() method. Job statistics and counters are printed to the console at this point. Finally, on job completion, the application master and the task containers clean up their working state (so intermediate output is deleted), and the OutputCommitter s commit Job() method is

15 7)Failures called. Job information is archived by the job history server to enable later interrogation by users if desired. In the real world, user code is buggy, processes crash, and machines fail. One of the major benefits of using Hadoop is its ability to handle such failures and allow your job to complete successfully. We need to consider the failure of any of the following entities: The task The application master The node manager, and The resource manager. 4b) Explain HDFS commands in detail? 1) Print the hadoop version Syntax: hadoop version ex: hadoop version 2) To create a directory Syntax: hadoopfs mkdir [-p] <path> -p: create parent directory along the path Ex: hadoopfs mkdir /hadoop/bec 3) List the contents in human readable format Syntax: hadoopfs ls [-R][_h] <path> Ex: hadoopfs ls /hadoop 4) Upload a file Syntax: hadoopfs put <local src><dest> Ex: hadoopfs put /home/ubuntu/desktop/hadoop.txt /hadoop/bec/hadop.txt 5) Download a file Syntax: hadoopfs get <src><local dest> Ex: hadoopfs -get /hadoop/bec/jes.txt.home/lavanya/doc.txt 6) View the content of the file Syntax: hadoopfs cat <path (filename)> Ex: hadoopfs cat /hadoop/bec/hadop.txt

16 7) Copy the file from src to destination within hdfs Syntax: hadoopfs cp [-f] <hdfssrc><hdfsdest> -f: to overwrite the file if already exists Ex: hadoopfs cp /hadoop/fds.txt /hadoop/bec/ 8) Move the file from src to destination within the hdfs Syntax: hadoopfs mv /hadoop/bec/hadop.txt /hadoop/ Ex: hadoopfs mv /hadoop/bec/hadoop.txt /hadoop/ 9) Copy from local to hdfs: Syntax: hadoopfs copyfromlocal[-f] <local src><dest> -f: to overwrite the file if already exists Ex: hadoopfs copyfromlocal /home/ubuntu/desktop/hadoop.txt /hadoop/bec/hadop.txt 10) Copy to local from hdfs Syntax: hadoopfs copytolocal[-f] <hdfssrc><dest> -f: to overwrite the file if already exists Ex: hadoopfs copytolocal /hadoop/bec/jes.txt.home/lavanya/doc.txt 11) To remove a directory or folder from the hdfs Syntax: hadoopfs rm [-R] [-skiptrash] <path> -skiptrash: to delete permanently To remove text file Ex: hadoopfs rm /hadoop/bec/hadop.txt To remove a folder Ex: hadoopfs rm R /hadoop/bec 12) To delete non empty files Syntax: hadoopfs rmdir [--ignore-fail-0n-non-empty] <path> --ignore-fail-0n-non-empty: don t fail even some fails are contain. Ex: hadoopfs rmdir /hadoop/bec/cse 13) Display last few lines of a file in hdfs Syntax: hadoopfs tail [-f] <path>

17 -f: will output the data as the file grows Ex: hadoopfs tail /hadoop/bec/hadop.txt 14) Display disk usage of files and directories Syntax: hadoopfs du [-s] [-h] <path> -s: give aggregate summary of file length -h: human readable format file Ex: hadoopfs du /hadoop/bec/hadop.txt 15) To empty the trash of hdfs Syntax: hadoopfs expunge Ex: hadoopfs expunge 5)a)Discuss about how applications runs on YARN?

18 YARN provides its core services via two types of long-running daemon: a resource manager (one per cluster) to manage the use of resources across the cluster, and node managers running on all the nodes in the cluster to launch and monitor containers. A container executes an application-specific process with a constrained set of resources (memory, CPU, and so on). To run an application on YARN, a client contacts the resource manager and asks it to run an application master process (step 1). The resource manager then finds a node manager that can launch the application master in a container (steps 2a and 2b). Precisely what the application master does once it is running depends on the application. It could simply run a computation in the container it is running in and return the result to the client. Or it could request more containers from the resource managers (step 3), and use them to run a distributed computation (steps 4a and 4b). YARN itself does not provide any way for the parts of the application (client, master, process) to communicate with one another. Most nontrivial YARN applications use some form of remote communication (such as Hadoop s RPC layer) to pass status updates and results back to the client, but these are specific to the application.

19 YARN has a flexible model for making resource requests. A request for a set of containers can express the amount of computer resources required for each container (memory and CPU), as well as locality constraints for the containers in that request. Locality is critical in ensuring that distributed data processing algorithms use the cluster bandwidth efficiently, so YARN allows an application to specify locality constraints for the containers it is requesting. Locality constraints can be used to request a container on a specific node or rack, or anywhere on the cluster (off-rack). Sometimes the locality constraint cannot be met, in which case either no allocation is made or, optionally, the constraint can be loosened. For example, if a specific node was requested but it is not possible to start a container on it (because other containers are running on it), then YARN will try to start a container on a node in the same rack, or,if that s not possible, on any node in the cluster. In the common case of launching a container to process an HDFS block (to run a map task in MapReduce, say), the application will request a container on one of the nodes hosting the block s three replicas, or on a node in one of the racks hosting the replicas,or, failing that, on any node in the cluster. A YARN application can make resource requests at any time while it is running. For example, an application can make all of its requests up front, or it can take a more dynamic approach whereby it requests more resources dynamically to meet the changing needs of the application. 5)b)Differentiate between YARN and Map Reduce? In MapReduce 1, there are two types of daemon that control the job execution process:ajobtrackerand one or more tasktrackers. The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers run tasks andsend progress reports to the jobtracker, which keeps a record of the overall progress ofeach job. If a task fails, the jobtracker can reschedule it on a different tasktracker.in MapReduce 1, the jobtracker takes care of both job scheduling and task progress monitoring By contrast,in YARN these responsibilities are handled by separate entities: the resource manager and an application master (one for each MapReduce job). The jobtracker is also responsible for storing job history for completed jobs, although it is possible to run ajob history server as a separate daemon to take the load off the jobtracker. In YARN, the equivalent role is the timeline server, which stores application history.the YARN equivalent of a tasktracker is a node manager YARN was designed to address many of the limitations in MapReduce 1. The benefits to using YARN include the following: Scalability YARN can run on larger clusters than MapReduce 1. MapReduce 1 hits scalability bottlenecks in the region of 4,000 nodes and 40,000 tasks,6stemming from the fact that the jobtracker has to manage both jobs and tasks. YARN overcomes these limitations by virtue of its split resource manager/application master architecture: it is designed to scale up to 10,000 nodes and 100,000 tasks.in contrast to the jobtracker, each instance of an application here, a MapReduce job has a dedicated application master, which runs for the duration of the application.this model is actually closer to the original Google MapReduce paper, which describes how a master process is started to coordinate map and reduce tasks running on a set of workers.

20 Availability High availability (HA) is usually achieved by replicating the state needed for another.daemon to take over the work needed to provide the service, in the event of the service daemon failing. However, the large amount of rapidly changing complexstate in the jobtracker s memory makes it very difficult to retrofit HA into the jobtrackerservice.with the jobtracker s responsibilities split between the resource manager and application master in YARN, making the service highly available became a divideand-conquer problem: provide HA for the resource manager, then for YARN applications And indeed, Hadoop 2 supports HA bothfor the resource manager and for the application master for MapReduce jobs Utilization In MapReduce 1, each tasktracker is configured with a static allocation of fixed-size slots, which are divided into map slots and reduce slots at configuration time. A map slot can only be used to run a map task, and a reduce slot can only be used for a reduce task. In YARN, a node manager manages a pool of resources, rather than a fixed number of designated slots. MapReduce running on YARN will not hit the situation where a reduce task has to wait because only map slots are available on the cluster, which can happen in MapReduce 1. If the resources to run the task are available, then the application will be eligible for them. Furthermore, resources in YARN are fine grained, so an application can make a request for what it needs, rather than for an indivisible slot, which may be too big or too small for the particular task. Multitenancy In some ways, the biggest benefit of YARN is that it opens up Hadoop to other types of distributed application beyond MapReduce. MapReduce is just one YARN application among many. It is even possible for users to run different versions of MapReduce on the same YARN cluster, which makes the process of upgrading MapReduce more manageable. UNIT-3 6)a)Explain about Architecture of pig and Explain its components? Parser:

21 Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges. Optimizer : The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown. Compiler : The compiler compiles the optimized logical plan into a series of MapReduce jobs. Execution engine : Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results. 6b)Explain PIG Latin scripting for word count program? wordcount.pig lines = load '/home/ubuntu/desktop/wordcount.txt' as (line: chararray); words = FOREACH lines GENERATE FLATTEN (TOKENIZE (line)) as word; grouped = group words by word; wordcount = FOREACH grouped GENERATE group, COUNT (words); dumpwordcount; Goto terminal Grunt> exec /home/ubuntu/desktop/wordcount.pig 7)Explain about hive working and hive architecture in detail?

22 Hive working: 1)Execute Query : The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute. 2)Get Plan : The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query. 3 )Get Metadata: The compiler sends metadata request to Metastore (any database). 4) Send Metadata: 5)Send Plan: Metastore sends metadata as a response to the compiler. The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete. 6) Execute Plan:

23 The driver sends the execute plan to the execution engine. 7 )Execute Job: Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job. 7.1) Metadata Ops: 8 )Fetch Result: 9) Send Results: Meanwhile in execution, the execution engine can execute metadata operations with Metastore. The execution engine receives the results from Data nodes. The execution engine sends those resultant values to the driver. 10 )Send Results: The driver sends the results to Hive Interfaces.: Hive Architecture: User Interface: Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server). Meta Store :

24 Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping. HiveQL Process Engine: HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it. Execution Engine : The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce. HDFS or HBASE: system. Hadoop distributed file system or HBASE are the data storage techniques to store data into file UNIT-4 8)Explain how job runs in SPARK?

25 At the highest level, there are two independent entities: the driver, which hosts the application (SparkContext) and schedules tasks for a job; and the executors, which are exclusive to the application, run for the duration of the application, and execute the application s tasks. Usually the driver runs as a client that is not managed by the cluster manager and the executors run on machines in the cluster. Job Submission : Spark job is submitted automatically when an action (such as count()) is performed on an RDD. Internally, this causes runjob() to be called on the SparkContext(step 1), which passes the call on to the scheduler that runs as a part of the driver (step 2). The scheduler is made up of two parts: a DAG scheduler that breaks down the job into a DAG of stages, and a task scheduler that is responsible for submitting the tasks from each stage to the cluster. DAG Construction : To understand how a job is broken up into stages, we need to look at the type of tasks that can run in a stage. There are two types: shuffle map tasks and result tasks. The name of the task type indicates what Spark does with the task s output: Shuffle map tasks As the name suggests, shuffle map tasks are like the map-side part of the shuffle in MapReduce. Each shuffle map task runs a computation on one RDD partition and, based on a partitioning function, writes its output to a new set of partitions, which are then fetched in a later stage (which could be composed of either shuffle map tasks or result tasks). Shuffle map tasks run in all stages except the final stage. Result tasks Result tasks run in the final stage that returns the result to the user s program (such as the result of a count()). Each result task runs a computation on its RDD partition,then sends the result back to the driver, and the driver assembles the results from each partition into a final result (which may be Unit, in the case of actions like saveastextfile()). The simplest Spark job is one that does not need a shuffle and therefore has just a single stage composed of result tasks. This is like a map-only job in MapReduce.

26 Task Scheduling When the task scheduler is sent a set of tasks, it uses its list of executors that are running for the application and constructs a mapping of tasks to executors that takes placement preferences into account. Next, the task scheduler assigns tasks to executors that have free cores (this may not be the complete set if another job in the same application is running), and it continues to assign more tasks as executors finish running tasks, until the task set is complete. Each task is allocated one core by default, although this can be changed by setting spark.task.cpus. Task Execution An executor runs a task as follows (step 7). First, it makes sure that the JAR and file dependencies for the task are up to date. The executor keeps a local cache of all the dependencies that previous tasks have used, so that it only downloads them when they have changed. Second, it deserializes the task code (which includes the user s functions) from the serialized bytes that were sent as a part of the launch task message. Third, the task code is executed. Note that tasks are run in the same JVM as the executor, so there is no process overhead for task launch.tasks can return a result to the driver. The result is serialized and sent to the executor backend, and then back to the driver as a status update message. A shuffle map task returns information that allows the next stage to retrieve the output partitions, while a result task returns the value of the result for the partition it ran on, which the driver assembles into a final result to return to the user s program.

27 9)a)Explain in detail about sqoop Import and Export statements in detail? 9b)Expalin about Functionality of import and export functions in Sqoop with neat diagrams

28 The Import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. All records are stored as text data in the text files or as binary data in Avro and Sequence files. Syntax The following syntax is used to import data into HDFS. $ sqoop import (generic-args) (import-args) Importing a Table Sqoop tool import is used to import table data from the table to the Hadoop file system as a text file or a binary file. Import an entire table: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table cities Import a subset of data:

29 sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table cities \ --where "country = 'USA'" IMPORT-ALL-TABLES The following syntax is used to import all tables. $ sqoop import-all-tables (generic-args) (import-args) $ sqoop-import-all-tables (generic-args) (import-args) The following command is used to import all the tables from the userdb database. $ sqoop import \ --connect jdbc:mysql://localhost/userdb \ --username root The following command is used to verify all the table data to the userdb database in HDFS. $ $HADOOP_HOME/bin/hadoopfs ls The default operation is to insert all the record from the input files to the database table using the INSERT statement.

30 In update mode, Sqoop generates the UPDATE statement that replaces the existing record into the database. The following is the syntax for the export command. $ sqoop export (generic-args) (export-args) $ sqoop-export (generic-args) (export-args) The following command is used to export the table data (which is in emp_data file on HDFS) to the employee table in db database of Mysql database server. $ sqoop export \ --connect jdbc:mysql://localhost/db \ --username root \ --table employee \ --export-dir /emp/emp_data

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018 Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018 K. Zhang (pic source: mapr.com/blog) Copyright BUDT 2016 758 Where

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Hadoop MapReduce Framework

Hadoop MapReduce Framework Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Introduction to HDFS and MapReduce

Introduction to HDFS and MapReduce Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : About Quality Thought We are

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Hadoop. Introduction to BIGDATA and HADOOP

Hadoop. Introduction to BIGDATA and HADOOP Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Part II (c) Desktop Installation. Net Serpents LLC, USA

Part II (c) Desktop Installation. Net Serpents LLC, USA Part II (c) Desktop ation Desktop ation ation Supported Platforms Required Software Releases &Mirror Sites Configure Format Start/ Stop Verify Supported Platforms ation GNU Linux supported for Development

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Ans. t-test will be performed when the standard deviation is unknown. It is a parametric test.

Ans. t-test will be performed when the standard deviation is unknown. It is a parametric test. 1. a. Describe the significance of t-test. Ans. t-test will be performed when the standard deviation is unknown. It is a parametric test. b. Write the R code for two sample t-test Ans. nsmk

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

More information

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45 Motivation MapReduce is hard to

More information

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce

More information

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG Prof R.Angelin Preethi #1 and Prof J.Elavarasi *2 # Department of Computer Science, Kamban College of Arts and Science for Women, TamilNadu,

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

50 Must Read Hadoop Interview Questions & Answers

50 Must Read Hadoop Interview Questions & Answers 50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc. D. Praveen Kumar Junior Research Fellow Department of Computer Science & Engineering Indian Institute of Technology (Indian School of Mines) Dhanbad, Jharkhand, India Head of IT & ITES, Skill Subsist Impels

More information

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera, How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Hadoop Setup on OpenStack Windows Azure Guide

Hadoop Setup on OpenStack Windows Azure Guide CSCI4180 Tutorial- 2 Hadoop Setup on OpenStack Windows Azure Guide ZHANG, Mi mzhang@cse.cuhk.edu.hk Sep. 24, 2015 Outline Hadoop setup on OpenStack Ø Set up Hadoop cluster Ø Manage Hadoop cluster Ø WordCount

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and

More information

Importing and Exporting Data Between Hadoop and MySQL

Importing and Exporting Data Between Hadoop and MySQL Importing and Exporting Data Between Hadoop and MySQL + 1 About me Sarah Sproehnle Former MySQL instructor Joined Cloudera in March 2010 sarah@cloudera.com 2 What is Hadoop? An open-source framework for

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Hadoop & Big Data Analytics Complete Practical & Real-time Training An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE

More information

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Expert Lecture plan proposal Hadoop& itsapplication

Expert Lecture plan proposal Hadoop& itsapplication Expert Lecture plan proposal Hadoop& itsapplication STARTING UP WITH BIG Introduction to BIG Data Use cases of Big Data The Big data core components Knowing the requirements, knowledge on Analyst job profile

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Introduction to the Hadoop Ecosystem - 1

Introduction to the Hadoop Ecosystem - 1 Hello and welcome to this online, self-paced course titled Administering and Managing the Oracle Big Data Appliance (BDA). This course contains several lessons. This lesson is titled Introduction to the

More information

Apache Hive for Oracle DBAs. Luís Marques

Apache Hive for Oracle DBAs. Luís Marques Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,

More information

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version : Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide THIRD EDITION Hadoop: The Definitive Guide Tom White Q'REILLY Beijing Cambridge Farnham Köln Sebastopol Tokyo labte of Contents Foreword Preface xv xvii 1. Meet Hadoop 1 Daw! 1 Data Storage and Analysis

More information

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program

More information

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE

More information

Hadoop Online Training

Hadoop Online Training Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the

More information

South Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10

South Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10 ISSN Number (online): 2454-9614 Weather Data Analytics using Hadoop Components like MapReduce, Pig and Hive Sireesha. M 1, Tirumala Rao. S. N 2 Department of CSE, Narasaraopeta Engineering College, Narasaraopet,

More information

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa ozawa.tsuyoshi@lab.ntt.co.jp ozawa@apache.org About me Tsuyoshi Ozawa Research Engineer @ NTT Twitter: @oza_x86_64 Over 150 reviews in 2015

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

2/26/2017. For instance, consider running Word Count across 20 splits

2/26/2017. For instance, consider running Word Count across 20 splits Based on the slides of prof. Pietro Michiardi Hadoop Internals https://github.com/michiard/disc-cloud-course/raw/master/hadoop/hadoop.pdf Job: execution of a MapReduce application across a data set Task:

More information

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI Department of Information Technology IT6701 - INFORMATION MANAGEMENT Anna University 2 & 16 Mark Questions & Answers Year / Semester: IV / VII Regulation: 2013

More information

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem. About the Tutorial Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Hadoop Map Reduce 10/17/2018 1

Hadoop Map Reduce 10/17/2018 1 Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

exam.   Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0 70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to

More information

Pig A language for data processing in Hadoop

Pig A language for data processing in Hadoop Pig A language for data processing in Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Pig: Introduction Tool for querying data on Hadoop

More information

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo Vendor: Hortonworks Exam Code: HDPCD Exam Name: Hortonworks Data Platform Certified Developer Version: Demo QUESTION 1 Workflows expressed in Oozie can contain: A. Sequences of MapReduce and Pig. These

More information

Department of Digital Systems. Digital Communications and Networks. Master Thesis

Department of Digital Systems. Digital Communications and Networks. Master Thesis Department of Digital Systems Digital Communications and Networks Master Thesis Study of technologies/research systems for big scientific data analytics Surname/Name: Petsas Konstantinos Registration Number:

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka Course Curriculum: Your 10 Module Learning Plan Big Data and Hadoop About Edureka Edureka is a leading e-learning platform providing live instructor-led interactive online training. We cater to professionals

More information

3 Hadoop Installation: Pseudo-distributed mode

3 Hadoop Installation: Pseudo-distributed mode Laboratory 3 Hadoop Installation: Pseudo-distributed mode Obecjective Hadoop can be run in 3 different modes. Different modes of Hadoop are 1. Standalone Mode Default mode of Hadoop HDFS is not utilized

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014 Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

More information

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

Actual4Dumps.   Provide you with the latest actual exam dumps, and help you succeed Actual4Dumps http://www.actual4dumps.com Provide you with the latest actual exam dumps, and help you succeed Exam : HDPCD Title : Hortonworks Data Platform Certified Developer Vendor : Hortonworks Version

More information

An Introduction to Big Data Analysis using Spark

An Introduction to Big Data Analysis using Spark An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,

More information

UNIT-IV HDFS. Ms. Selva Mary. G

UNIT-IV HDFS. Ms. Selva Mary. G UNIT-IV HDFS HDFS ARCHITECTURE Dataset partition across a number of separate machines Hadoop Distributed File system The Design of HDFS HDFS is a file system designed for storing very large files with

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Hortonworks Data Platform

Hortonworks Data Platform Hortonworks Data Platform Workflow Management (August 31, 2017) docs.hortonworks.com Hortonworks Data Platform: Workflow Management Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later

More information

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo Vendor: Cloudera Exam Code: CCD-410 Exam Name: Cloudera Certified Developer for Apache Hadoop Version: Demo QUESTION 1 When is the earliest point at which the reduce method of a given Reducer can be called?

More information

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Techno Expert Solutions An institute for specialized studies!

Techno Expert Solutions An institute for specialized studies! Course Content of Big Data Hadoop( Intermediate+ Advance) Pre-requistes: knowledge of Core Java/ Oracle: Basic of Unix S.no Topics Date Status Introduction to Big Data & Hadoop Importance of Data& Data

More information

Data Analytics Job Guarantee Program

Data Analytics Job Guarantee Program Data Analytics Job Guarantee Program 1. INSTALLATION OF VMWARE 2. MYSQL DATABASE 3. CORE JAVA 1.1 Types of Variable 1.2 Types of Datatype 1.3 Types of Modifiers 1.4 Types of constructors 1.5 Introduction

More information

Installation of Hadoop on Ubuntu

Installation of Hadoop on Ubuntu Installation of Hadoop on Ubuntu Various software and settings are required for Hadoop. This section is mainly developed based on rsqrl.com tutorial. 1- Install Java Software Java Version* Openjdk version

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce

More information

Microsoft Big Data and Hadoop

Microsoft Big Data and Hadoop Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common

More information

Guidelines For Hadoop and Spark Cluster Usage

Guidelines For Hadoop and Spark Cluster Usage Guidelines For Hadoop and Spark Cluster Usage Procedure to create an account in CSX. If you are taking a CS prefix course, you already have an account; to get an initial password created: 1. Login to https://cs.okstate.edu/pwreset

More information

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2 Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File

More information

Hadoop Quickstart. Table of contents

Hadoop Quickstart. Table of contents Table of contents 1 Purpose...2 2 Pre-requisites...2 2.1 Supported Platforms... 2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster...3 5 Standalone

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

If you have ever appeared for the Hadoop interview, you must have experienced many Hadoop scenario based interview questions.

If you have ever appeared for the Hadoop interview, you must have experienced many Hadoop scenario based interview questions. Scenario Based Hadoop Interview Questions & Answers [Mega List] If you have ever appeared for the Hadoop interview, you must have experienced many Hadoop scenario based interview questions. Here I have

More information

UNIT II HADOOP FRAMEWORK

UNIT II HADOOP FRAMEWORK UNIT II HADOOP FRAMEWORK Hadoop Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.

More information

Chase Wu New Jersey Institute of Technology

Chase Wu New Jersey Institute of Technology CS 644: Introduction to Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Institute of Technology Some of the slides were provided through the courtesy of Dr. Ching-Yung Lin at Columbia

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information