Quick Understand How To Develop An End-to-End E- commerce application with Hadoop & Spark

Size: px
Start display at page:

Download "Quick Understand How To Develop An End-to-End E- commerce application with Hadoop & Spark"

Transcription

1 World Journal of Technology, Engineering and Research, Volume 3, Issue 1 (2018) 8-17 Contents available at WJTER World Journal of Technology, Engineering and Research Journal Homepage: Quick Understand How To Develop An End-to-End E- commerce application with Hadoop & Spark Tiruveedula GopiKrishna a* a Department of Computer Science and Engineering School of Electrical Engineering and Computing Adama Science and Technology University, Ethiopia Keywords Hadoop MapReduce Pig, Sqoop Hive MySQL Spark Oozie Scala A B S T R A C T Now a day, Big data analytics has widespread applications in nearly every industry. However, the main success areas of analytics are in e-commerce, revenue growth, increased customer size, accuracy of sale forecast results, product optimization, risk management, and improved customer segmentation. Here I am going to demonstrate one end-toend ecommerce application step by step execution flow using hadoop components WJTER All rights reserved. 8

2 1.INTRODUCTION Since 2012, big data has promised to be more utilized in future, as organization both small and large employs big data analytics in creating a competitive advantage. Big Data is defined as data that exceeds the processing capacity of conventional database management system because of its volume, velocity, and variability. Within this data lie valuable patterns and information that previously require amount of work and cost to extract them[1]. Hadoop as a big data processing technology has been around for 10 years and has proven to be the solution of choice for processing large data sets. MapReduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass computations and algorithms. Each step in the data processing workflow has one Map phase and one Reduce phase and you'll need to convert any use case into MapReduce pattern to leverage this solution [1]. The Job output data between each step has to be stored in the distributed file system before the next step can begin. Hence, this approach tends to be slow due to replication & disk storage. Also, Hadoop solutions typically include clusters that are hard to set up and manage. It also requires the integration of several tools for different big data use cases (like Mahout for Machine Learning and Storm for streaming data processing) [1]. If you wanted to do something complicated, you would have to string together a series of MapReduce jobs and execute them in sequence. Each of those jobs was high-latency, and none could start until the previous job had finished completely [1]. Spark allows programmers to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It also supports in-memory data sharing across DAGs, so that different jobs can work with the same data [1]. Spark runs on top of existing Hadoop Distributed File System (HDFS)infrastructure to provide enhanced and additional functionality. It provides support for deploying Spark applications in an existing Hadoop v1 cluster (with SIMR Spark-Inside-MapReduce) or Hadoop v2 YARN cluster or even Apache Mesos [1]. We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop. It s not intended to replace Hadoop but to provide a comprehensive and unified solution to manage different big data use cases and requirements [1]. 2.PROJECT - DEPLOYMENT GUIDE Step 1:We need to configure source and target HDFS paths inparam.properties file. Step 2: To check the script where log path and Data validation report captured automatically. Script name:copytohdfs.sh Script contains: nano CopyToHdfs.sh. /home/gopalkrishna/install/oozie-4.2.0/projectnew/apps/map-reduce/parameter timestamp=$(date +"%Y-%m-%d-%S") hadoop dfsadmin -safemode leave hdfs dfs -rm -r projectnew hdfs dfs -put /home/gopalkrishna/install/oozie-4.2.0/projectnew/ projectnew echo " " 9

3 echo "OOZIE time based scheduling configuration loaded successfully in HDFS" cat $sourcepath/*.log wc -l >> /home/gopalkrishna/oozie_project_logs/hdfslog_$timestamp hadoop fs -ls -R $dirpath >> /home/gopalkrishna/oozie_project_logs/hdfslog_$timestamp hadoop fs -cat $dirpath/input-data/*.log wc -l >>/home/gopalkrishna/oozie_project_logs/hdfslog_$timestamp hadoop fs -du $dirpath >> /home/gopalkrishna/oozie_project_logs/hdfslog_$timestamp hadoop fsck -blocks /user/gopalkrishna/$dirpath >> /home/gopalkrishna/oozie_project_logs/hdfslog_$timestamp hadoop fs -count $dirpath >> /home/gopalkrishna/oozie_project_logs/hdfslog_$timestamp hadoop fs -stat $dirpath/* >> /home/gopalkrishna/oozie_project_logs/hdfslog_$timestamp echo "Data Loading Done on HDFS Sucessfully" Step 3:Open the script and check the log path with current timestamp. Step 4: Checkall configuration file directory for oozie configuration Main configuration files job.properties where the job will get initiated workflow.xml Collection of all <action> tags where in each and every action, we will configure one task level detail. Fig.1: Checking configuration file directory for main oozie configuration files Step 5: Edit the job.properties file according to our cluster details and our job time intervals nano Job.properties namenode=hdfs://localhost:8020 resourcemanager=localhost:8032 queuename=default examplesroot=projectnew outputdir=custpartout oozie.use.system.libpath=true oozie.wf.application.path=${namenode}/user/${user.name}/${examplesroot}/apps/mapreduce/workflow.xml. <workflow-app xmlns="uri:oozie:workflow:0.1" name="map-reduce-wf"> <start to="mr-node"/> <action name="mr-node"> <map-reduce> <prepare> <delete path="${namenode}/user/${wf:user()}/${examplesroot}/output-data/${outputdir}"/> </prepare> <configuration> <name>mapred.mapper.new-api</name> <value>true</value> <name>mapred.reducer.new-api</name> <value>true</value> 10

4 <name>mapred.job.queue.name</name> <value>${queuename}</value> <name>mapreduce.map.class</name> <value>com.mapred.custpart.emppartition_mapper</value> <name>mapreduce.reduce.class</name> <value>com.mapred.custpart.emppartition_reducer</value> <name>mapred.output.key.class</name> <value>org.apache.hadoop.io.text</value> <name>mapred.output.value.class</name> <value>org.apache.hadoop.io.text</value> <name>mapreduce.partitioner.class</name> <value>com.mapred.custpart.emppartitioner</value> <name>mapred.reduce.tasks</name> <value>4</value> <name>mapred.input.dir</name> <value>/user/${wf:user()}/${examplesroot}/input-data/*.log</value> <name>mapred.output.dir</name> <value>/user/${wf:user()}/${examplesroot}/output-data/${outputdir}</value> </configuration> </map-reduce> <ok to="pig-node"/> <error to="fail-mr"/> <action name="pig-node"> <pig> <!--<prepare> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/xmloutput"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput1"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput2"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput3"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput4"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput1"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput2"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput3"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput4"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/cloudoutput"/> 11

5 <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/fsioutput"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/mfgoutput"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/otheroutput"/> </prepare> --> <configuration> <name>mapred.job.queue.name</name> <value>${queuename}</value> <name>mapred.compress.map.output</name> <value>true</value> </configuration> <script>pigscript.pig</script> <param>input=/user/${wf:user()}/${examplesroot}/input-data/cuinput.xml</param> <param>input1=/user/${wf:user()}/${examplesroot}/output-data/${outputdir}/part-r </param> <param>input2=/user/${wf:user()}/${examplesroot}/output-data/${outputdir}/part-r </param> <param>input3=/user/${wf:user()}/${examplesroot}/output-data/${outputdir}/part-r </param> <param>input4=/user/${wf:user()}/${examplesroot}/output-data/${outputdir}/part-r </param> <param>output=/user/${wf:user()}/${examplesroot}/output-data/pig/xmloutput</param> <param>output1=/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput1</param> <param>output2=/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput2</param> <param>output3=/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput3</param> <param>output4=/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput4</param> <param>output5=/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput1</param> <param>output6=/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput2</param> <param>output7=/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput3</param> <param>output8=/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput4</param> <param>outputcloud=/user/${wf:user()}/${examplesroot}/output- Data/pig/CLOUDOUTPUT</param> <param>outputfsi=/user/${wf:user()}/${examplesroot}/output- Data/pig/FSIOUTPUT</param> <param>outputmfg=/user/${wf:user()}/${examplesroot}/output- Data/pig/MFGOUTPUT</param> <param>outputother=/user/${wf:user()}/${examplesroot}/output- Data/pig/OTHEROUTPUT</param> </pig> <ok to="sqoopactioncloud"/> <error to="fail-pig"/> <action name="sqoopactioncloud"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <command>export --connect jdbc:mysql://localhost/projectnew --username root --password root -- table cloud --export-dir /user/gopalkrishna/projectnew/output-data/pig/cloudoutput/part-r m 1 </command> </sqoop> <ok to="sqoopactionfsi"/> <error to="fail-sqoopcloud"/> 12

6 <action name="sqoopactionfsi"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <command>export --connect jdbc:mysql://localhost/projectnew --username root --password root -- table fsi --export-dir /user/gopalkrishna/projectnew/output-data/pig/fsioutput/part-r m 1 </command> </sqoop> <ok to="sqoopactionmfg"/> <error to="fail-sqoopfsi"/> <action name="sqoopactionmfg"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <command>export --connect jdbc:mysql://localhost/projectnew --username root --password root -- table mfg --export-dir /user/gopalkrishna/projectnew/output-data/pig/mfgoutput/part-r m 1 </command> </sqoop> <ok to="sqoopactionother"/> <error to="fail-sqoopmfg"/> <action name="sqoopactionother"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <command>export --connect jdbc:mysql://localhost/projectnew --username root --password root -- table other --export-dir /user/gopalkrishna/projectnew/output-data/pig/otheroutput/part-r m 1 </command> </sqoop> <ok to="hive-node"/> <error to="fail-sqoopother"/> <action name="hive-node"> <hive xmlns="uri:oozie:hive-action:0.2"> <!--<prepare> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/colud"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/fsi"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/mfg"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/other"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/partbuckettabcloud"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/partbuckettabfsi"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/partbuckettabmfg"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/partbuckettabother"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/yearcountcloud"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/yearcountfsi"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/yearcountmfg"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/yearcountother"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/gdcountcloud"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/gdcountfsi"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/gdcountmfg"/> 13

7 <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/gdcountother"/> <mkdir path="/user/${wf:user()}/${examplesroot}/output-data/hive"/> </prepare> --> <configuration> <name>mapred.job.queue.name</name> <value>${queuename}</value> </configuration> <script>hivescript.hql</script> <param>hinput1=/user/${wf:user()}/${examplesroot}/output- Data/pig/CLOUDOUTPUT</param> <param>hinput2=/user/${wf:user()}/${examplesroot}/output-data/pig/fsioutput</param> <param>hinput3=/user/${wf:user()}/${examplesroot}/output- Data/pig/MFGOUTPUT</param> <param>hinput4=/user/${wf:user()}/${examplesroot}/output- Data/pig/OTHEROUTPUT</param> <param>houtput1=/user/${wf:user()}/${examplesroot}/output-data/hive/colud</param> <param>houtput2=/user/${wf:user()}/${examplesroot}/output-data/hive/fsi</param> <param>houtput3=/user/${wf:user()}/${examplesroot}/output-data/hive/mfg</param> <param>houtput4=/user/${wf:user()}/${examplesroot}/output-data/hive/other</param> <param>houtput5=/user/${wf:user()}/${examplesroot}/output- Data/hive/partbuckettabcloud</param> <param>houtput6=/user/${wf:user()}/${examplesroot}/output- Data/hive/partbuckettabfsi</param> <param>houtput7=/user/${wf:user()}/${examplesroot}/output- Data/hive/partbuckettabmfg</param> <param>houtput8=/user/${wf:user()}/${examplesroot}/output- Data/hive/partbuckettabother</param> <param>houtput9=/user/${wf:user()}/${examplesroot}/output- Data/hive/yearcountcloud</param> <param>houtput10=/user/${wf:user()}/${examplesroot}/output- Data/hive/yearcountfsi</param> <param>houtput11=/user/${wf:user()}/${examplesroot}/output- Data/hive/yearcountmfg</param> <param>houtput12=/user/${wf:user()}/${examplesroot}/output- Data/hive/yearcountother</param> <param>houtput13=/user/${wf:user()}/${examplesroot}/output- Data/hive/gdcountcloud</param> <param>houtput14=/user/${wf:user()}/${examplesroot}/output- Data/hive/gdcountfsi</param> <param>houtput15=/user/${wf:user()}/${examplesroot}/output- Data/hive/gdcountmfg</param> <param>houtput16=/user/${wf:user()}/${examplesroot}/output- Data/hive/gdcountother</param> </hive> <ok to="end"/> <error to="fail-hive"/> <kill name="fail-mr"> <message>map/reduce failed, error message[${wf:errormessage(wf:lasterrornode())}]</message> <kill name="fail-pig"> <message>pig failed, error message[${wf:errormessage(wf:lasterrornode())}]</message> 14

8 <kill name="fail-sqoopcloud"> <message>sqoop HDM failed, error message[${wf:errormessage(wf:lasterrornode())}]</message> <kill name="fail-sqoopfsi"> <message>sqoop WP failed, error message[${wf:errormessage(wf:lasterrornode())}]</message> <kill name="fail-sqoopmfg"> <message>sqoop OGMS failed, error message[${wf:errormessage(wf:lasterrornode())}]</message> <kill name="fail-sqoopother"> <message>sqoop RP failed, error message[${wf:errormessage(wf:lasterrornode())}]</message> <kill name="fail-hive"> <message>hive failed, error message[${wf:errormessage(wf:lasterrornode())}]</message> <end name="end"/> </workflow-app> Step 6:Start hadoop daemons: Execution Steps To start hadoop daemons & oozie daemon & job history server. [Jobhistory server It's one of Map Reduce Daemon] Web URIs : for history server., for resource manager. for namenode, for oozie Start OOZIE Server $OOZIE_HOME/bin/oozie-start.sh - to start oozie $OOZIE_HOME/bin/oozie-stop.sh - to stop oozie. Run jps command for all daemons. Step 7:run oozie Step 8:check the web url for oozie ( 3. FUNCTIONAL REQUIREMENTS Input source (File Name : CustInputData.log) Input Data format(.csv) {ID, DEPTNAME, GENDER, AGE, JOININGDATE} Sample InputData (CSV) 1000,CLOUD,MALE,25, ,FSI,FEMALE,28, ,MFG,MALE,34, ,CLOUD,FEMALE,36, ,FSI,MALE,39, ,MFG,MALE,45, ,RTL,MALE,44, ,ETM,MALE,43, ,CLOUD,MALE,25, Take the complete Input data intohdfs.develop MapReduce usecasebytaking HDFS input data partitioned by deptname (key,value separated by '\t )part-r-00000: cloud 10000,male,25, part-r-00001: fs 1001,female,25, part-r-00000: mfg,10000,male,25, part-r : other 1001,female,25, Developa PIG Scripttofilterthe Map Reduce Outputinthe belowfashion. Load the xml data Join xml data and mapreduce output data Provide the Unique data Filter the Unique data base on age >25 15

9 Sort the Unique data based onid EXPORTthe same PIGOutputfrom HDFStoMySQLusingSQOOP. Create Hive External tables and load pig processed output. Create Hive External tables partitioning by gender and Clustered by id into 4 Buckets. Generate various analysiss reports through Hive. Fig.2:Solution Architecture of the E-Commerce Project Fig.3: Application Flow of the E-Commerce Project 4. DETAILED FLOW OF THEPROJECT Client Provided Input Data Need to be loaded in HDFS. For that we are usingcopytohdfs.sh. To analyze the data and to retrievee the value out of it, I am using Map Reduce Processing. High Level Steps Involved in Map Reduce Processing Mapper Class for Transformationphase Reducer Logic for BusinessComputations Custom Practitioner Logic to get department wise data Output format to hold theoutput. 5. TO ELIMINATE THE DUPLICATED VALUES (IF ANY) Basedon IDfor that we areusingpigcomponent from Map Reduce Output and to sort the data. 6. SAME PIG OUTPUT The same Pig output, we are loading in Hive Externaltables (where data persisted even in case of table drops on HDFS) so that we can generate the Adhoc Query Reports as per 16

10 CustomerRequirements. 7.SCHEDULING OF ALL THE HADOOP JOBS THROUGHOOZIE To send the processed PIG Output to the Dashboard solution, we are exporting the same data to external RDBMS using SQOOP Utility ofhadoop. 8.OOZIE JOBSCHEDULING To schedule all these hadoop Jobs, we have to configure OOZIE workflow.xmlwhere we are specifying about each and every individual taskdetailsina <action> tags. Aspartof OOZIE, beloware corebuilding blocks job. properties HighLevelJobParametersandfromwherethe JobInitiation need to done Workflow.xml Collection of AllAction Nodes..whereOneAction isone Task Co-ordinator.xml To Schedulethe workflow.xmlona timely manner (Hourly, Daily or Monthly etc.) 9. DEPLOYMENTSTEPS Scriptsinvolvedinthe deployment parameter.sh CopyToHdfs.sh Execution scriptfor oozie job( oozie-run.sh) pigscript.pig hivescript.hql 10. PREREQUISITES FOR DEPLOYMENT Allhadoop daemonincluding oozieshouldbe up andrunning --All daemon should be up andrunning ---use jps to check the same Copy allscript fileinonedirectory CheckHadoop version(hadoopversion) Check pig version( It require0.13 versiononwards) Check hiveversion(itrequire0.12 onwards) 11.RESEARCH CHALLENGES FOR BIG DATA Using the Hadoop components quantitative research and survey in the area of big data in data analytics. Determining the theories which can be mobilized for studying big data in analytics 3 Developing metrics to measure for any kind of data analytics performance in big data setting. Determining the sequence of intermediate mechanisms between big data and supply chain performance. 5 Determining the way of integrating SCM initiatives into big data analytics programs 6 Studying the impact of big data on external supply chain. 11. CONCLUSION I highly recommend it for any aspiring Spark developers looking for a place to get started.today, Spark is being adopted by major players like Amazon, ebay, and Yahoo! Many organizations run Spark on clusters with thousands of nodes. According to the Spark FAQ, the largest known cluster has over 8000 nodes. Indeed, Spark is a technology well worth taking note of and learning about.one of the developing trends in big data analytics, including E-commerce projects and young projects for data analysis. This paper shows how to process any big data log file using MapReduce and how Hadoop components are used for parallel computation of big data files. It has proved that processing big data with the help of Hadoop components leads to minimum computation and response time. I also speculate on what the future holds for big data analysis and the Hadoop ecosystem Future Hadoop (which may also include Apache Spark and others). References [1] 17

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

About 1. Chapter 1: Getting started with oozie 2. Remarks 2. Versions 2. Examples 2. Installation or Setup 2. Chapter 2: Oozie

About 1. Chapter 1: Getting started with oozie 2. Remarks 2. Versions 2. Examples 2. Installation or Setup 2. Chapter 2: Oozie oozie #oozie Table of Contents About 1 Chapter 1: Getting started with oozie 2 Remarks 2 Versions 2 Examples 2 Installation or Setup 2 Chapter 2: Oozie 101 7 Examples 7 Oozie Architecture 7 Oozie Application

More information

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem

More information

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer) Hortonworks Hadoop-PR000007 Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer) http://killexams.com/pass4sure/exam-detail/hadoop-pr000007 QUESTION: 99 Which one of the following

More information

Hortonworks Data Platform

Hortonworks Data Platform Hortonworks Data Platform Workflow Management (August 31, 2017) docs.hortonworks.com Hortonworks Data Platform: Workflow Management Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks

More information

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Hadoop & Big Data Analytics Complete Practical & Real-time Training An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE

More information

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : About Quality Thought We are

More information

Hortonworks PR PowerCenter Data Integration 9.x Administrator Specialist.

Hortonworks PR PowerCenter Data Integration 9.x Administrator Specialist. Hortonworks PR000007 PowerCenter Data Integration 9.x Administrator Specialist https://killexams.com/pass4sure/exam-detail/pr000007 QUESTION: 102 When can a reduce class also serve as a combiner without

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Java Cookbook. Java Action specification. $ java -Xms512m a.b.c.mymainclass arg1 arg2

Java Cookbook. Java Action specification. $ java -Xms512m a.b.c.mymainclass arg1 arg2 Java Cookbook This document comprehensively describes the procedure of running Java code using Oozie. Its targeted audience is all forms of users who will install, use and operate Oozie. Java Action specification

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Hadoop. Introduction to BIGDATA and HADOOP

Hadoop. Introduction to BIGDATA and HADOOP Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program

More information

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Hadoop course content

Hadoop course content course content COURSE DETAILS 1. In-detail explanation on the concepts of HDFS & MapReduce frameworks 2. What is 2.X Architecture & How to set up Cluster 3. How to write complex MapReduce Programs 4. In-detail

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version : Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński Introduction into Big Data analytics Lecture 3 Hadoop ecosystem Janusz Szwabiński Outlook of today s talk Apache Hadoop Project Common use cases Getting started with Hadoop Single node cluster Further

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

Importing and Exporting Data Between Hadoop and MySQL

Importing and Exporting Data Between Hadoop and MySQL Importing and Exporting Data Between Hadoop and MySQL + 1 About me Sarah Sproehnle Former MySQL instructor Joined Cloudera in March 2010 sarah@cloudera.com 2 What is Hadoop? An open-source framework for

More information

Chase Wu New Jersey Institute of Technology

Chase Wu New Jersey Institute of Technology CS 644: Introduction to Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Institute of Technology Some of the slides were provided through the courtesy of Dr. Ching-Yung Lin at Columbia

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka Course Curriculum: Your 10 Module Learning Plan Big Data and Hadoop About Edureka Edureka is a leading e-learning platform providing live instructor-led interactive online training. We cater to professionals

More information

Oracle Big Data Fundamentals Ed 1

Oracle Big Data Fundamentals Ed 1 Oracle University Contact Us: +0097143909050 Oracle Big Data Fundamentals Ed 1 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big Data

More information

50 Must Read Hadoop Interview Questions & Answers

50 Must Read Hadoop Interview Questions & Answers 50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Oracle Data Integrator 12c: Integration and Administration

Oracle Data Integrator 12c: Integration and Administration Oracle University Contact Us: Local: 1800 103 4775 Intl: +91 80 67863102 Oracle Data Integrator 12c: Integration and Administration Duration: 5 Days What you will learn Oracle Data Integrator is a comprehensive

More information

Configuring and Deploying Hadoop Cluster Deployment Templates

Configuring and Deploying Hadoop Cluster Deployment Templates Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page

More information

File Inclusion Vulnerability Analysis using Hadoop and Navie Bayes Classifier

File Inclusion Vulnerability Analysis using Hadoop and Navie Bayes Classifier File Inclusion Vulnerability Analysis using Hadoop and Navie Bayes Classifier [1] Vidya Muraleedharan [2] Dr.KSatheesh Kumar [3] Ashok Babu [1] M.Tech Student, School of Computer Sciences, Mahatma Gandhi

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Hadoop, Yarn and Beyond

Hadoop, Yarn and Beyond Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets

More information

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

ExamTorrent.   Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you ExamTorrent http://www.examtorrent.com Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you Exam : Apache-Hadoop-Developer Title : Hadoop 2.0 Certification exam for Pig

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

HDInsight > Hadoop. October 12, 2017

HDInsight > Hadoop. October 12, 2017 HDInsight > Hadoop October 12, 2017 2 Introduction Mark Hudson >20 years mixing technology with data >10 years with CapTech Microsoft Certified IT Professional Business Intelligence Member of the Richmond

More information

A complete Hadoop Development Training Program.

A complete Hadoop Development Training Program. Asterix Solution s Big Data - Hadoop Training Program A complete Hadoop Development Training Program. Your Journey to Professional Hadoop Development training starts here! Hadoop! Hadoop! Hadoop! If you

More information

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem. About the Tutorial Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and

More information

Oracle Big Data Fundamentals Ed 2

Oracle Big Data Fundamentals Ed 2 Oracle University Contact Us: 1.800.529.0165 Oracle Big Data Fundamentals Ed 2 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, you learn about big data, the technologies

More information

BIG DATA COURSE CONTENT

BIG DATA COURSE CONTENT BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data

More information

Microsoft Big Data and Hadoop

Microsoft Big Data and Hadoop Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common

More information

Introduction to the Hadoop Ecosystem - 1

Introduction to the Hadoop Ecosystem - 1 Hello and welcome to this online, self-paced course titled Administering and Managing the Oracle Big Data Appliance (BDA). This course contains several lessons. This lesson is titled Introduction to the

More information

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

exam.   Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0 70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Big Data Analytics. Description:

Big Data Analytics. Description: Big Data Analytics Description: With the advance of IT storage, pcoressing, computation, and sensing technologies, Big Data has become a novel norm of life. Only until recently, computers are able to capture

More information

Sqoop In Action. Lecturer:Alex Wang QQ: QQ Communication Group:

Sqoop In Action. Lecturer:Alex Wang QQ: QQ Communication Group: Sqoop In Action Lecturer:Alex Wang QQ:532500648 QQ Communication Group:286081824 Aganda Setup the sqoop environment Import data Incremental import Free-Form Query Import Export data Sqoop and Hive Apache

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

Actual4Dumps.   Provide you with the latest actual exam dumps, and help you succeed Actual4Dumps http://www.actual4dumps.com Provide you with the latest actual exam dumps, and help you succeed Exam : HDPCD Title : Hortonworks Data Platform Certified Developer Vendor : Hortonworks Version

More information

Exam Questions

Exam Questions Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure

More information

Hadoop Online Training

Hadoop Online Training Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the

More information

Oracle Data Integrator 12c: Integration and Administration

Oracle Data Integrator 12c: Integration and Administration Oracle University Contact Us: +34916267792 Oracle Data Integrator 12c: Integration and Administration Duration: 5 Days What you will learn Oracle Data Integrator is a comprehensive data integration platform

More information

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example: Hadoop User Guide Logging on to the Hadoop Cluster Nodes To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example: ssh username@roger-login.ncsa. illinois.edu after entering

More information

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved. Gain Insights From Unstructured Data Using Pivotal HD 1 Traditional Enterprise Analytics Process 2 The Fundamental Paradigm Shift Internet age and exploding data growth Enterprises leverage new data sources

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps://

IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps:// IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps://www.certqueen.com Exam : 1Z1-449 Title : Oracle Big Data 2017 Implementation Essentials Version : DEMO 1 / 4 1.You need to place

More information

Talend Big Data Sandbox. Big Data Insights Cookbook

Talend Big Data Sandbox. Big Data Insights Cookbook Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

Talend Open Studio for Big Data. Getting Started Guide 5.3.2

Talend Open Studio for Big Data. Getting Started Guide 5.3.2 Talend Open Studio for Big Data Getting Started Guide 5.3.2 Talend Open Studio for Big Data Adapted for v5.3.2. Supersedes previous Getting Started Guide releases. Publication date: January 24, 2014 Copyleft

More information

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG Prof R.Angelin Preethi #1 and Prof J.Elavarasi *2 # Department of Computer Science, Kamban College of Arts and Science for Women, TamilNadu,

More information

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Principal Software Engineer Red Hat Emerging Technology June 24, 2015 USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica. Hadoop Ecosystem

Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica. Hadoop Ecosystem Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Hadoop Ecosystem Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini Why an

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Hadoop Ecosystem. Why an ecosystem

Hadoop Ecosystem. Why an ecosystem Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Hadoop Ecosystem Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini Why an

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Hadoop Lab 2 Exploring the Hadoop Environment

Hadoop Lab 2 Exploring the Hadoop Environment Programming for Big Data Hadoop Lab 2 Exploring the Hadoop Environment Video A short video guide for some of what is covered in this lab. Link for this video is on my module webpage 1 Open a Terminal window

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

The Evolution of Big Data Platforms and Data Science

The Evolution of Big Data Platforms and Data Science IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering

More information

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo Vendor: Cloudera Exam Code: CCD-410 Exam Name: Cloudera Certified Developer for Apache Hadoop Version: Demo QUESTION 1 When is the earliest point at which the reduce method of a given Reducer can be called?

More information

Techno Expert Solutions An institute for specialized studies!

Techno Expert Solutions An institute for specialized studies! Course Content of Big Data Hadoop( Intermediate+ Advance) Pre-requistes: knowledge of Core Java/ Oracle: Basic of Unix S.no Topics Date Status Introduction to Big Data & Hadoop Importance of Data& Data

More information

International Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur

International Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur Prabhjot Kaur Department of Computer Engineering ME CSE(BIG DATA ANALYTICS)-CHANDIGARH UNIVERSITY,GHARUAN kaurprabhjot770@gmail.com ABSTRACT: In today world, as we know data is expanding along with the

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

Apache Hive for Oracle DBAs. Luís Marques

Apache Hive for Oracle DBAs. Luís Marques Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Prototyping Data Intensive Apps: TrendingTopics.org

Prototyping Data Intensive Apps: TrendingTopics.org Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page

More information

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US Outline Introduction and Background MapReduce Iterative MapReduce Distributed Workflow Management

More information