Quick Understand How To Develop An End-to-End E- commerce application with Hadoop & Spark

Size: px

Start display at page:

Download "Quick Understand How To Develop An End-to-End E- commerce application with Hadoop & Spark"

Cori Young
5 years ago
Views:

World Journal of Technology, Engineering and Research, Volume 3, Issue 1 (2018) 8-17 Contents available at WJTER World Journal of Technology, Engineering and Research Journal Homepage: www.wjter.

Engineering and Computing Adama Science and Technology University, Ethiopia Keywords Hadoop MapReduce Pig, Sqoop Hive MySQL Spark Oozie Scala A B S T R A C T Now a day, Big data analytics has

1 World Journal of Technology, Engineering and Research, Volume 3, Issue 1 (2018) 8-17 Contents available at WJTER World Journal of Technology, Engineering and Research Journal Homepage: Quick Understand How To Develop An End-to-End E- commerce application with Hadoop & Spark Tiruveedula GopiKrishna a* a Department of Computer Science and Engineering School of Electrical Engineering and Computing Adama Science and Technology University, Ethiopia Keywords Hadoop MapReduce Pig, Sqoop Hive MySQL Spark Oozie Scala A B S T R A C T Now a day, Big data analytics has widespread applications in nearly every industry. However, the main success areas of analytics are in e-commerce, revenue growth, increased customer size, accuracy of sale forecast results, product optimization, risk management, and improved customer segmentation. Here I am going to demonstrate one end-toend ecommerce application step by step execution flow using hadoop components WJTER All rights reserved. 8

2 1.INTRODUCTION Since 2012, big data has promised to be more utilized in future, as organization both small and large employs big data analytics in creating a competitive advantage. Big Data is defined as data that exceeds the processing capacity of conventional database management system because of its volume, velocity, and variability. Within this data lie valuable patterns and information that previously require amount of work and cost to extract them[1]. Hadoop as a big data processing technology has been around for 10 years and has proven to be the solution of choice for processing large data sets. MapReduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass computations and algorithms. Each step in the data processing workflow has one Map phase and one Reduce phase and you'll need to convert any use case into MapReduce pattern to leverage this solution [1]. The Job output data between each step has to be stored in the distributed file system before the next step can begin. Hence, this approach tends to be slow due to replication & disk storage. Also, Hadoop solutions typically include clusters that are hard to set up and manage. It also requires the integration of several tools for different big data use cases (like Mahout for Machine Learning and Storm for streaming data processing) [1]. If you wanted to do something complicated, you would have to string together a series of MapReduce jobs and execute them in sequence. Each of those jobs was high-latency, and none could start until the previous job had finished completely [1]. Spark allows programmers to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It also supports in-memory data sharing across DAGs, so that different jobs can work with the same data [1]. Spark runs on top of existing Hadoop Distributed File System (HDFS)infrastructure to provide enhanced and additional functionality. It provides support for deploying Spark applications in an existing Hadoop v1 cluster (with SIMR Spark-Inside-MapReduce) or Hadoop v2 YARN cluster or even Apache Mesos [1]. We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop. It s not intended to replace Hadoop but to provide a comprehensive and unified solution to manage different big data use cases and requirements [1]. 2.PROJECT - DEPLOYMENT GUIDE Step 1:We need to configure source and target HDFS paths inparam.properties file. Step 2: To check the script where log path and Data validation report captured automatically. Script name:copytohdfs.sh Script contains: nano CopyToHdfs.sh. /home/gopalkrishna/install/oozie-4.2.0/projectnew/apps/map-reduce/parameter timestamp=$(date +"%Y-%m-%d-%S") hadoop dfsadmin -safemode leave hdfs dfs -rm -r projectnew hdfs dfs -put /home/gopalkrishna/install/oozie-4.2.0/projectnew/ projectnew echo " " 9

3 echo "OOZIE time based scheduling configuration loaded successfully in HDFS" cat $sourcepath/*.log wc -l >> /home/gopalkrishna/oozie_project_logs/hdfslog_$timestamp hadoop fs -ls -R $dirpath >> /home/gopalkrishna/oozie_project_logs/hdfslog_$timestamp hadoop fs -cat $dirpath/input-data/*.log wc -l >>/home/gopalkrishna/oozie_project_logs/hdfslog_$timestamp hadoop fs -du $dirpath >> /home/gopalkrishna/oozie_project_logs/hdfslog_$timestamp hadoop fsck -blocks /user/gopalkrishna/$dirpath >> /home/gopalkrishna/oozie_project_logs/hdfslog_$timestamp hadoop fs -count $dirpath >> /home/gopalkrishna/oozie_project_logs/hdfslog_$timestamp hadoop fs -stat $dirpath/* >> /home/gopalkrishna/oozie_project_logs/hdfslog_$timestamp echo "Data Loading Done on HDFS Sucessfully" Step 3:Open the script and check the log path with current timestamp. Step 4: Checkall configuration file directory for oozie configuration Main configuration files job.properties where the job will get initiated workflow.xml Collection of all <action> tags where in each and every action, we will configure one task level detail. Fig.1: Checking configuration file directory for main oozie configuration files Step 5: Edit the job.properties file according to our cluster details and our job time intervals nano Job.properties namenode=hdfs://localhost:8020 resourcemanager=localhost:8032 queuename=default examplesroot=projectnew outputdir=custpartout oozie.use.system.libpath=true oozie.wf.application.path=${namenode}/user/${user.name}/${examplesroot}/apps/mapreduce/workflow.xml. <workflow-app xmlns="uri:oozie:workflow:0.1" name="map-reduce-wf"> <start to="mr-node"/> <action name="mr-node"> <map-reduce> <prepare> <delete path="${namenode}/user/${wf:user()}/${examplesroot}/output-data/${outputdir}"/> </prepare> <configuration> <name>mapred.mapper.new-api</name> <value>true</value> <name>mapred.reducer.new-api</name> <value>true</value> 10

4 <name>mapred.job.queue.name</name> <value>${queuename}</value> <name>mapreduce.map.class</name> <value>com.mapred.custpart.emppartition_mapper</value> <name>mapreduce.reduce.class</name> <value>com.mapred.custpart.emppartition_reducer</value> <name>mapred.output.key.class</name> <value>org.apache.hadoop.io.text</value> <name>mapred.output.value.class</name> <value>org.apache.hadoop.io.text</value> <name>mapreduce.partitioner.class</name> <value>com.mapred.custpart.emppartitioner</value> <name>mapred.reduce.tasks</name> <value>4</value> <name>mapred.input.dir</name> <value>/user/${wf:user()}/${examplesroot}/input-data/*.log</value> <name>mapred.output.dir</name> <value>/user/${wf:user()}/${examplesroot}/output-data/${outputdir}</value> </configuration> </map-reduce> <ok to="pig-node"/> <error to="fail-mr"/> <action name="pig-node"> <pig> <!--<prepare> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/xmloutput"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput1"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput2"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput3"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput4"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput1"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput2"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput3"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput4"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/cloudoutput"/> 11

5 <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/fsioutput"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/mfgoutput"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/pig/otheroutput"/> </prepare> --> <configuration> <name>mapred.job.queue.name</name> <value>${queuename}</value> <name>mapred.compress.map.output</name> <value>true</value> </configuration> <script>pigscript.pig</script> <param>input=/user/${wf:user()}/${examplesroot}/input-data/cuinput.xml</param> <param>input1=/user/${wf:user()}/${examplesroot}/output-data/${outputdir}/part-r </param> <param>input2=/user/${wf:user()}/${examplesroot}/output-data/${outputdir}/part-r </param> <param>input3=/user/${wf:user()}/${examplesroot}/output-data/${outputdir}/part-r </param> <param>input4=/user/${wf:user()}/${examplesroot}/output-data/${outputdir}/part-r </param> <param>output=/user/${wf:user()}/${examplesroot}/output-data/pig/xmloutput</param> <param>output1=/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput1</param> <param>output2=/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput2</param> <param>output3=/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput3</param> <param>output4=/user/${wf:user()}/${examplesroot}/output-data/pig/mroutput4</param> <param>output5=/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput1</param> <param>output6=/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput2</param> <param>output7=/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput3</param> <param>output8=/user/${wf:user()}/${examplesroot}/output-data/pig/joinoutput4</param> <param>outputcloud=/user/${wf:user()}/${examplesroot}/output- Data/pig/CLOUDOUTPUT</param> <param>outputfsi=/user/${wf:user()}/${examplesroot}/output- Data/pig/FSIOUTPUT</param> <param>outputmfg=/user/${wf:user()}/${examplesroot}/output- Data/pig/MFGOUTPUT</param> <param>outputother=/user/${wf:user()}/${examplesroot}/output- Data/pig/OTHEROUTPUT</param> </pig> <ok to="sqoopactioncloud"/> <error to="fail-pig"/> <action name="sqoopactioncloud"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <command>export --connect jdbc:mysql://localhost/projectnew --username root --password root -- table cloud --export-dir /user/gopalkrishna/projectnew/output-data/pig/cloudoutput/part-r m 1 </command> </sqoop> <ok to="sqoopactionfsi"/> <error to="fail-sqoopcloud"/> 12

6 <action name="sqoopactionfsi"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <command>export --connect jdbc:mysql://localhost/projectnew --username root --password root -- table fsi --export-dir /user/gopalkrishna/projectnew/output-data/pig/fsioutput/part-r m 1 </command> </sqoop> <ok to="sqoopactionmfg"/> <error to="fail-sqoopfsi"/> <action name="sqoopactionmfg"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <command>export --connect jdbc:mysql://localhost/projectnew --username root --password root -- table mfg --export-dir /user/gopalkrishna/projectnew/output-data/pig/mfgoutput/part-r m 1 </command> </sqoop> <ok to="sqoopactionother"/> <error to="fail-sqoopmfg"/> <action name="sqoopactionother"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <command>export --connect jdbc:mysql://localhost/projectnew --username root --password root -- table other --export-dir /user/gopalkrishna/projectnew/output-data/pig/otheroutput/part-r m 1 </command> </sqoop> <ok to="hive-node"/> <error to="fail-sqoopother"/> <action name="hive-node"> <hive xmlns="uri:oozie:hive-action:0.2"> <!--<prepare> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/colud"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/fsi"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/mfg"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/other"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/partbuckettabcloud"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/partbuckettabfsi"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/partbuckettabmfg"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/partbuckettabother"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/yearcountcloud"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/yearcountfsi"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/yearcountmfg"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/yearcountother"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/gdcountcloud"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/gdcountfsi"/> <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/gdcountmfg"/> 13

7 <delete path="/user/${wf:user()}/${examplesroot}/output-data/hive/gdcountother"/> <mkdir path="/user/${wf:user()}/${examplesroot}/output-data/hive"/> </prepare> --> <configuration> <name>mapred.job.queue.name</name> <value>${queuename}</value> </configuration> <script>hivescript.hql</script> <param>hinput1=/user/${wf:user()}/${examplesroot}/output- Data/pig/CLOUDOUTPUT</param> <param>hinput2=/user/${wf:user()}/${examplesroot}/output-data/pig/fsioutput</param> <param>hinput3=/user/${wf:user()}/${examplesroot}/output- Data/pig/MFGOUTPUT</param> <param>hinput4=/user/${wf:user()}/${examplesroot}/output- Data/pig/OTHEROUTPUT</param> <param>houtput1=/user/${wf:user()}/${examplesroot}/output-data/hive/colud</param> <param>houtput2=/user/${wf:user()}/${examplesroot}/output-data/hive/fsi</param> <param>houtput3=/user/${wf:user()}/${examplesroot}/output-data/hive/mfg</param> <param>houtput4=/user/${wf:user()}/${examplesroot}/output-data/hive/other</param> <param>houtput5=/user/${wf:user()}/${examplesroot}/output- Data/hive/partbuckettabcloud</param> <param>houtput6=/user/${wf:user()}/${examplesroot}/output- Data/hive/partbuckettabfsi</param> <param>houtput7=/user/${wf:user()}/${examplesroot}/output- Data/hive/partbuckettabmfg</param> <param>houtput8=/user/${wf:user()}/${examplesroot}/output- Data/hive/partbuckettabother</param> <param>houtput9=/user/${wf:user()}/${examplesroot}/output- Data/hive/yearcountcloud</param> <param>houtput10=/user/${wf:user()}/${examplesroot}/output- Data/hive/yearcountfsi</param> <param>houtput11=/user/${wf:user()}/${examplesroot}/output- Data/hive/yearcountmfg</param> <param>houtput12=/user/${wf:user()}/${examplesroot}/output- Data/hive/yearcountother</param> <param>houtput13=/user/${wf:user()}/${examplesroot}/output- Data/hive/gdcountcloud</param> <param>houtput14=/user/${wf:user()}/${examplesroot}/output- Data/hive/gdcountfsi</param> <param>houtput15=/user/${wf:user()}/${examplesroot}/output- Data/hive/gdcountmfg</param> <param>houtput16=/user/${wf:user()}/${examplesroot}/output- Data/hive/gdcountother</param> </hive> <ok to="end"/> <error to="fail-hive"/> <kill name="fail-mr"> <message>map/reduce failed, error message[${wf:errormessage(wf:lasterrornode())}]</message> <kill name="fail-pig"> <message>pig failed, error message[${wf:errormessage(wf:lasterrornode())}]</message> 14

8 <kill name="fail-sqoopcloud"> <message>sqoop HDM failed, error message[${wf:errormessage(wf:lasterrornode())}]</message> <kill name="fail-sqoopfsi"> <message>sqoop WP failed, error message[${wf:errormessage(wf:lasterrornode())}]</message> <kill name="fail-sqoopmfg"> <message>sqoop OGMS failed, error message[${wf:errormessage(wf:lasterrornode())}]</message> <kill name="fail-sqoopother"> <message>sqoop RP failed, error message[${wf:errormessage(wf:lasterrornode())}]</message> <kill name="fail-hive"> <message>hive failed, error message[${wf:errormessage(wf:lasterrornode())}]</message> <end name="end"/> </workflow-app> Step 6:Start hadoop daemons: Execution Steps To start hadoop daemons & oozie daemon & job history server. [Jobhistory server It's one of Map Reduce Daemon] Web URIs : for history server., for resource manager. for namenode, for oozie Start OOZIE Server $OOZIE_HOME/bin/oozie-start.sh - to start oozie $OOZIE_HOME/bin/oozie-stop.sh - to stop oozie. Run jps command for all daemons. Step 7:run oozie Step 8:check the web url for oozie ( 3. FUNCTIONAL REQUIREMENTS Input source (File Name : CustInputData.log) Input Data format(.csv) {ID, DEPTNAME, GENDER, AGE, JOININGDATE} Sample InputData (CSV) 1000,CLOUD,MALE,25, ,FSI,FEMALE,28, ,MFG,MALE,34, ,CLOUD,FEMALE,36, ,FSI,MALE,39, ,MFG,MALE,45, ,RTL,MALE,44, ,ETM,MALE,43, ,CLOUD,MALE,25, Take the complete Input data intohdfs.develop MapReduce usecasebytaking HDFS input data partitioned by deptname (key,value separated by '\t )part-r-00000: cloud 10000,male,25, part-r-00001: fs 1001,female,25, part-r-00000: mfg,10000,male,25, part-r : other 1001,female,25, Developa PIG Scripttofilterthe Map Reduce Outputinthe belowfashion. Load the xml data Join xml data and mapreduce output data Provide the Unique data Filter the Unique data base on age >25 15

Sort the Unique data based onid EXPORTthe same PIGOutputfrom HDFStoMySQLusingSQOOP. Create Hive External tables and load pig processed output.

2:Solution Architecture of the E-Commerce Project Fig.3: Application Flow of the E-Commerce Project 4.

To analyze the data and to retrievee the value out of it, I am using Map Reduce Processing.

Practitioner Logic to get department wise data Output format to hold theoutput. 5.

9 Sort the Unique data based onid EXPORTthe same PIGOutputfrom HDFStoMySQLusingSQOOP. Create Hive External tables and load pig processed output. Create Hive External tables partitioning by gender and Clustered by id into 4 Buckets. Generate various analysiss reports through Hive. Fig.2:Solution Architecture of the E-Commerce Project Fig.3: Application Flow of the E-Commerce Project 4. DETAILED FLOW OF THEPROJECT Client Provided Input Data Need to be loaded in HDFS. For that we are usingcopytohdfs.sh. To analyze the data and to retrievee the value out of it, I am using Map Reduce Processing. High Level Steps Involved in Map Reduce Processing Mapper Class for Transformationphase Reducer Logic for BusinessComputations Custom Practitioner Logic to get department wise data Output format to hold theoutput. 5. TO ELIMINATE THE DUPLICATED VALUES (IF ANY) Basedon IDfor that we areusingpigcomponent from Map Reduce Output and to sort the data. 6. SAME PIG OUTPUT The same Pig output, we are loading in Hive Externaltables (where data persisted even in case of table drops on HDFS) so that we can generate the Adhoc Query Reports as per 16

10 CustomerRequirements. 7.SCHEDULING OF ALL THE HADOOP JOBS THROUGHOOZIE To send the processed PIG Output to the Dashboard solution, we are exporting the same data to external RDBMS using SQOOP Utility ofhadoop. 8.OOZIE JOBSCHEDULING To schedule all these hadoop Jobs, we have to configure OOZIE workflow.xmlwhere we are specifying about each and every individual taskdetailsina <action> tags. Aspartof OOZIE, beloware corebuilding blocks job. properties HighLevelJobParametersandfromwherethe JobInitiation need to done Workflow.xml Collection of AllAction Nodes..whereOneAction isone Task Co-ordinator.xml To Schedulethe workflow.xmlona timely manner (Hourly, Daily or Monthly etc.) 9. DEPLOYMENTSTEPS Scriptsinvolvedinthe deployment parameter.sh CopyToHdfs.sh Execution scriptfor oozie job( oozie-run.sh) pigscript.pig hivescript.hql 10. PREREQUISITES FOR DEPLOYMENT Allhadoop daemonincluding oozieshouldbe up andrunning --All daemon should be up andrunning ---use jps to check the same Copy allscript fileinonedirectory CheckHadoop version(hadoopversion) Check pig version( It require0.13 versiononwards) Check hiveversion(itrequire0.12 onwards) 11.RESEARCH CHALLENGES FOR BIG DATA Using the Hadoop components quantitative research and survey in the area of big data in data analytics. Determining the theories which can be mobilized for studying big data in analytics 3 Developing metrics to measure for any kind of data analytics performance in big data setting. Determining the sequence of intermediate mechanisms between big data and supply chain performance. 5 Determining the way of integrating SCM initiatives into big data analytics programs 6 Studying the impact of big data on external supply chain. 11. CONCLUSION I highly recommend it for any aspiring Spark developers looking for a place to get started.today, Spark is being adopted by major players like Amazon, ebay, and Yahoo! Many organizations run Spark on clusters with thousands of nodes. According to the Spark FAQ, the largest known cluster has over 8000 nodes. Indeed, Spark is a technology well worth taking note of and learning about.one of the developing trends in big data analytics, including E-commerce projects and young projects for data analysis. This paper shows how to process any big data log file using MapReduce and how Hadoop components are used for parallel computation of big data files. It has proved that processing big data with the help of Hadoop components leads to minimum computation and response time. I also speculate on what the future holds for big data analysis and the Hadoop ecosystem Future Hadoop (which may also include Apache Spark and others). References [1] 17

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals