A New Model of Search Engine based on Cloud Computing

Size: px
Start display at page:

Download "A New Model of Search Engine based on Cloud Computing"

Transcription

1 A New Model of Search Engine based on Cloud Computing DING Jian-li 1,2, YANG Bo 1 1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin , China 2. Tianjin Key Lab for Advanced Signal Processing, Civil Aviation University of China, Tianjin , China doi: /jdcta.vol5.issue6.28 Abstract With the rapid increase of websites and internet users, the traditional search engine will face great challenge in the real-time search, response speed and the storage of mass pages. However, the search engine deployed in the cloud can solve these shortcomings due to cloud computing with two major advantages in mass data processing and mass data storage. By analyzing the open-source cloud computing system Hadoop, a cloud platform search engine model is constructed and the core algorithm of search engine is optimized to improve the overall performance of search engine. Keywords: Cloud Computing, Hadoop, Search Engine, Model, Algorithm Optimization 1. Introduction In recent years, researchers generally research more on the vertical search engine [1,2] and has achieved a lot. However, these studies only focus on the application area of search engine. Meanwhile, with the great development of internet technology, combined with the tremendous development of 3G networks, the online population and web pages are rapidly increasing. Traditional architecture model of search engine can t adapt the development of network, and search engine is now facing the question how the mass data in the network are stored and how the mass data in the network is processed fast. The cloud computing technology [3] provides a new way to solve these problems, with two features of mass data storage and mass data processing. And there are a lot of open- source cloud computing projects. Hadoop, which is an open-source project of Apache Software Foundation, is widely used [4]. It can fully utilize the advantages between search engine and Hadoop to build the search engine on the Hadoop, and makes up for the shortcomings of search engine. 2. Hadoop Hadoop is an open-source distributed parallel computing platform, and it has the reliability, efficiency and scalability. It mainly consists of a parallel computing framework MapReduce [5] and a distributed file system HDFS [6], which ensure Hadoop efficiently parallel computing ability and mass data storage capacity. Hadoop is designed based on the concepts that the system failure is a normal, and makes cloud computing platform run reliably by maintaining multiple replications available and re-distributing the new node as fast as possible in case some nodes are failed. Hadoop uses master-slave structure. It has a simple master server (JobTracker) and some slave servers (TaskTracker) which are in a cluster. JobTracker is an interactive interface between users and the framework. When users submit the task to the JobTracker, JobTracker puts this task into the task queue and executes tasks according to the first come first served principle. JobTracker maintains the Map and Reduce tasks assigned to TaskTrackers. TaskTracker executes instructions that get from JobTracker, and simultaneously deals with the exchange of data between Map and Reduce. Each node will periodically report the completed work and updated status to TaskTracker. If a TaskTracker doesn t communicate with JobTracker for a long time (it should be specified), JobTracker records this node as dead and assigns this node s data to other nodes

2 2.1. MapReduce Computing Framework MapReduce mainly indicates the two aspects, Map and Reduce, and it completes Mapping operation and Reducing operation respectively. Each Mapping operation is independent, with a high degree of parallelism. Reducing operation receives the results of Mapping operation and merges the results. Meanwhile, Reducing operation is highly parallel, too. It is because of the highly parallel distributed computing of MapReduce so that the mass data can be efficiently processed on the cloud computing platform. MapReduce functions are as follows [7]: Map:(in_key,in_value) {key j, value j j=1 k} Reduce:(key, [value 1,, value m ]) (key, final_value) The input parameters of Map are in_key and in_value. The output of Map is a set of <key,value>. The input parameters of Reduce is (key, [value 1,..., value m ]). After receiving the parameters, Reduce is run to merge the data which were get from Map and output(key, final_value). MapReduce operation model is shown in figure 1: Fig 1. MapReduce Operation Model The MapReduce execution flow is as follows: 1: Input(File A); 2: Split(A)// Separate file A into m of data blocks sized between 16M to 64M; 3: The master control program distributes m of Maper and r of Reducer machines; 4: Maper 1,..., Maper m execute Map (in_key, in_value) at the same time AND the result will be stored in the local cache; 5: Reducer 1,..., Reducer r get results from Maper by getting through remote call and execute Reduce (key, [value 1,... value m ]); 6: Output(Final_file B); 2.2. HDFS (Hadoop Distributed File System) HDFS is designed with master-slave mode. It manages the entire data of cloud computing platform, which consists of two special nodes. HDFS includes one NameNode and lots of DataNodes. NameNode provides metadata services within HDFS. DataNode provides the storage blocks for HDFS. HDFS architecture is shown in figure

3 Fig 2. HDFS Architecture When user program accesses the HDFS, it visits NameNode firstly and gets metadata information. Then, it directly visits DataNodes and accesses data. This design of HDFS makes the control flow and data flow separate. There is only control flow between user program and NameNode, without data stream, thus it can greatly reduce the load of NameNode and can t become a bottleneck in system performance. There are direct data flows between DataNodes and user program. And because a file is divided into some data blocks for distributed storage and data backup, user program can access some DataNodes at the same time, which makes the I / O of whole system highly parallel so that the whole system performance is improved. 3. Overview of Search Engine Search engine is the most efficacious tool to discover useable information in World Wide Web. And Search engine has become a necessary to explore internet. Without Search Engine, there are no uses of information in websites, blog, etc; because without search engine, it is almost impossible to look for one by one websites just for searching information in internet. Search engine is a system based on certain strategies, using specific computer programs to collect information on internet, organizing the information and providing retrieval services for users. Search engine generally consists of five parts: fetcher (information collection devices), parser, indexer, retriever and user interface. The system structure is shown in figure 3:

4 Fig 3. Search Engine Architecture Where, Fetcher is known as web crawler and its function is to find and collect information from internet. Parser analyzes the document to collect, and then provides them to the indexer. Indexer transforms the document as a form to easily retrieve and stored in the index database. By using the indexes stored in the index database created by indexer and keywords input by users, Retriever finds out the documents that match the keywords, and sort the results. User interface is provided for users to find information conveniently. Search engine workflow is as follows: 1) Prepare links. Add seed links to xml file or text file and submit it to the WebDB which is the local folder. Link preparing module will read one of the URLs and give it to Fetcher. 2) Crawl pages. After receiving a URL, Fetcher begins to crawl the pages with the breadth-first search strategy and stored all the pages in local files. 3) Analysis pages. After receiving crawled pages, Parser begins to analysis pages and extracts pages text and feature information, such as title, time, source and so on. There are two important tasks after parsing web pages for Parser. One of tasks is that storing the URL lists get from web pages to the local folder Segment and generating a new link list to crawl the pages for Fetcher. Another is that submitting the page text to Indexer. And Parser integrates and simplifies new link lists and saves them to local for assigning new crawling task easily. 4) Repeat the step 2 3 until it reaches the crawling depth set by user. 5) Create Index. After finishing the steps above, Indexer starts building an index, and stores the index to local. 6) User s search. When the user submits a query, the retrieval module starts to search the pages related to the topic in local according to the index, integrates the results of the query and show the results to the user from most relevant one to the least one. As we can see from the framework of the search engine mentioned above, the traditional search engine works in a focused manner so that it can t achieve efficient parallel operation. So the current search engine is difficult to deal with the mass data in network efficiently and provide users with searching services in time. These are issues search engine are facing, but also to be solved in this paper. 4. Search Engine Model Based on Cloud Computing Platform Through analyzing the distributed computing framework of Hadoop and the current search engine architecture above, it can be seen that Search engine is not good at dealing with mass data,

5 but cloud computing can make up for the lack of search engine because of its efficient distributed computing framework MapReduce and distributed file system HDFS with mass data storage capacity. Building the search engine on Hadoop platform can solve the problems in mass data processing and mass data storage. And the search engine will be greatly improved in real-time search and response speed. The Search engine model based on cloud computing platform is shown in figure 4. Fig 4. Search Engine Model Based on Cloud Computing Platform As we can see from the figure 4, the bottom of search engine is cloud computing platform based on Hadoop. And there are two improvements on this model compared to the traditional search engine. One of them is that there are great changes on the computing manner of fetcher, parser, indexer and retriever. They run based on the framework of MapReduce in this model. Another is that index database is replaced by HDFS and the index is managed and maintained by the distributed file system with master-slave mode. The following is the workflow of this model. 1) Prepare links. Add seed links to xml file or text file and submit it to the WebDB the folder of HDFS. Link preparation module will read the URLs and split them into some of link blocks. 2) Crawl pages. After receiving link blocks, JobTracker mentioned above start Map/Reduce tasks and assigns the task of crawling pages to TaskTrackers. The TaskTracker (Fetcher) which received task of Map begins to crawl the pages with the breadth-first search strategy. The TaskTracker (Fetcher) which received the task of Reduce integrates and filters the pages crawled by TaskTracker which runs the task of Map, and stores all the pages in the HDFS. 3) Analysis pages. When Parser received crawled pages, MapReduce tasks begin. The TaskTracker (Parser) which received the task of Map analysis pages and extracts pages content and feature information, such as title, time, source and so on. There are two important tasks after parsing web page for Parser. One of them is that storing the URL lists get from web pages to Segment the folder of HDFS and generating a new link list for Fetcher to crawl the pages. Another is that submitting the page content to Indexer. TaskTracker (Parser) which received Reduce task integrates and simplifies new link list and submits them to HDFS to be managed together by NameNode for assigning new crawling task easily. 4) Repeat the step 2 3 until it reaches the crawling depth set by user. 5) Create Index. Indexer begins MapReduce task after crawling the pages. The TaskTracker

6 (Indexer) received the task of Map start building an index, and store the index to local. The TaskTracker (Indexer) which received the tasks of Reduce submits the indexs stored in all TaskTracker (indexer) which run the tasks of Map to NameNode for the unified management and facilitating user s queries. 6) User s search. When the user submits a query, the retrieval module starts to perform MapReduce tasks. The TaskTracker (Retriever) which received the task of Map begins to search the pages related to the topic in local. The TaskTracker (Retriever) which received the task of Reduce integrates the results of the query and submits results to the user according to the relevance with the topic from high to the end. In addition, WebDB and Segment are data structure used in general search engine, and not to be repeated here. 5. Keywords Weighting Improvement The search engine model based on cloud computing platform is described above. However, keywords weighting needed to be optimized so as to get the best performance. The TF-IDF method is most widely used in keywords weighting. The importance in a single document and it in the entire data set of a keyword is considered simultaneously by this method. The advantage of this method is that it can make the weight of the keyword more reasonable which appears more often in a single document and the entire data set. For example, the frequency of "biological" is high in data sets related to biology, but the importance of this word to the document is less than "DNA" and "cell". It is significant to select the keywords in order to exclude nonessential information and reducing the indexing time. Inverted document frequency (IDF) and Term frequency (TF) are calculated respectively as (1), (2). [8] (1) (2) Where, is the number of data files (text), is the number of documents which contain the keyword, is the number of keyword appeared in document and is the number of all words in document. Considering the TF and IDF, the weight of the keyword k in the document is calculated as (3). As we can see from (1) (2), term frequency a statistical value is considered only in calculating the keywords weight. But there are lots of non-statistical values, for example, the keywords in title, the bold or italic keywords. Compared to normal pages content, these elements represent the characteristics of pages more. Moreover, it is very easy to be analyzed. The relevant parameters are easily obtained by distinguishing markup in HTML language. The improved method of TF-IDF is (4). (3) (4) Where, is parameter of pages. Generally, when keywords are in the subject, the value of is the maximum taking 10;when keywords are bold or italic in the document, the value of takes 5; when there is no tag, the value of takes

7 6. Experiment and Results Analyze Limited due to experimental environment, we use 8 PC running Ubuntu OS to build Cloud Computing Platform and install search engine system on this platform. Among them, one computer is NameNode and the remaining seven computers are DataNodes. To test the efficiency of search engine deployed in the cloud computing platform, we take the crawling depth as a data increasing index and capture data from my school intranet ( from low depth to higher ones. Each layer crawling is carried out with three times and the result is the average of three sets of data. Experiment results are shown in table 1 and figure 5. Depth Platform Table 1. Crawling Time of Different Platform (UNIT: MIN) Cloud Computing Integration Fig 5. Comparison of Different Platform in Crawling Time As we can see from table 1 and figure 5, the time used in searching by search engine deployed in cloud computing platform is more than it used by centralized search engine before depth 5. It is because that search engine based on cloud computing platform takes a larger proportion of the capture time to communicate with each other when the amount of data is small, which ultimately affects the entire crawling time. On the contrary, centralized search engine costs less time to crawl pages because there is no communication and there is a small amount of data. When the depth is higher than 5, with the increase in the amount of data, the advantage of processing mass data is gradually emerging out, and capture time is much less than the centralized search engine. Moreover, this advantage is more and more obvious with the increase in the amount of data. After finishing capture the data and searching the keywords through the client, we can get the satisfactory results in front of 20 records within 0.01ms ~0.49ms. By analyzing experiment results, we can see that the search engine deployed in the cloud computing platform can solve the problems that search engine deals with the mass data inefficiently, and the result displayed has been greatly improved. 7. Conclusion Based on the depth analysis between cloud computing and search engine, there is a good combination point found. That deploying Search engine to the cloud computing platform can deal with the problems existing in search engine, which are mass data processing and mass data mining. Moreover a search engine model based on open source cloud computing platform Hadoop is proposed and the two algorithms of search engine are improved. As we can see from the experiment results, the experiment results are satisfactory overall and achieve the expected

8 results. But there are two shortcomings. On the one hand, experiments were carried out on the intranet but not internet. On the other hand, there is only one web sites with limited data sets, so the mass data processing capacity of cloud computing didn t perform well. Further work will focus on these two aspects so as to achieve better results. 8. Acknowledgement Foundation item: Project (2006AA12A106) supported by the National High Technology Research and Development Program (863); Project ( , ) supported by the National Natural Science Foundation of China; Project (MHRD201013) supported by Civil Aviation Administration Science Foundation of China. 9. References [1] Dorin Carstoiu, Elena Lepadatu, Mihai Gaspar, "Hbase - non SQL Database, Performances Evaluation", IJACT, Vol. 2, No. 5, pp. 42 ~ 52, [2] Waralak V. Siricharoen, "Using Integrated Ontologies for Determining Objects towards Software Engineering Approach", AISS, Vol. 2, No. 4, pp. 61 ~ 70, [3] Hochul Jeon, Taehwan Kim, Joongmin Choi, "Personalized Information Retrieval by Using Adaptive User Profiling and Collaborative Filtering", AISS, Vol. 2, No. 4, pp. 134 ~ 142, [4] Omid Kashefi, Nina Mohseni, Behrouz Minaei, "Optimizing Document Similarity Detection in Persian Information Retrieval", JCIT, Vol. 5, No. 2, pp. 101 ~ 106, [5] Wang Ying, Liu Guangli, Bai Shengli, Yang Zhimin, "Attribute Extraction System for Agricultural SEM", JCIT, Vol. 5, No. 3, pp. 20 ~ 23, [6] Debajyoti Mukhopadhyay, Sukanta Sinha, "A Novel Approach for Domain Specific Lucky Web Search", JCIT, Vol. 5, No. 5, pp. 72 ~ 80, [7] Peng LIU, "Cloud Computing", "Electronic Industry Press", pp: 10-18, [8] Chang-yuan FENG, Jie-xin PU, "Research about Algorithm of Web Text Feather Selection", "Application Research of Computers", Vol.07, 2005, pp: 36-38,2005. [9] Tao WANG, Xiao-zhong FAN, "Design and implementation of topical crawler", "Computer Applications", Vol.24, 2004, pp: ,2004. [10] Li-zhu ZHOU, Ling LIN, "Survey on the research of focused crawling technique", "Computer Applications", Vol.25 (9), 2005, pp: ,2005. [11] Havelieala. "Topic-sensitive PageRank". Proceeding of the 11th International World Wide Web Conference, Hawaii, pp: , [12] Michael Armbrust, Armando Fox, and Rean Griffith, et al. "Above the Clouds: A Berkeley View of Cloud Computing", mimeo, UC Berkeley, RAD Laboratory, [13] Hadoop Distributed File System: Architecture and Design. [14] Hadoop Site. [15] Hadoop Map/Reduce tutorial. [16]

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 DOI:10.19026/ajfst.5.3106 ISSN: 2042-4868; e-issn: 2042-4876 2013 Maxwell Scientific Publication Corp. Submitted: May 29, 2013 Accepted:

More information

The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c

The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c Applied Mechanics and Materials Online: 2013-09-27 ISSN: 1662-7482, Vols. 423-426, pp 2733-2736 doi:10.4028/www.scientific.net/amm.423-426.2733 2013 Trans Tech Publications, Switzerland The Design of Distributed

More information

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering

More information

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch 619 A publication of CHEMICAL ENGINEERING TRANSACTIONS VOL. 51, 2016 Guest Editors: Tichun Wang, Hongyang Zhang, Lei Tian Copyright 2016, AIDIC Servizi S.r.l., ISBN 978-88-95608-43-3; ISSN 2283-9216 The

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Decision analysis of the weather log by Hadoop

Decision analysis of the weather log by Hadoop Advances in Engineering Research (AER), volume 116 International Conference on Communication and Electronic Information Engineering (CEIE 2016) Decision analysis of the weather log by Hadoop Hao Wu Department

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 11 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(11), 2014 [5368-5376] The study on magnanimous data-storage system based

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Journal of East China Normal University (Natural Science) Data calculation and performance optimization of dairy traceability based on Hadoop/Hive

Journal of East China Normal University (Natural Science) Data calculation and performance optimization of dairy traceability based on Hadoop/Hive 4 2018 7 ( ) Journal of East China Normal University (Natural Science) No. 4 Jul. 2018 : 1000-5641(2018)04-0099-10 Hadoop/Hive 1, 1, 1, 1,2, 1, 1 (1., 210095; 2., 210095) :,, Hadoop/Hive, Hadoop/Hive.,,

More information

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Hwajung Lee Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Cloud Introduction Cloud Service Model Big Data Hadoop MapReduce HDFS (Hadoop Distributed

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory

More information

Processing Technology of Massive Human Health Data Based on Hadoop

Processing Technology of Massive Human Health Data Based on Hadoop 6th International Conference on Machinery, Materials, Environment, Biotechnology and Computer (MMEBC 2016) Processing Technology of Massive Human Health Data Based on Hadoop Miao Liu1, a, Junsheng Yu1,

More information

Research and Design of Key Technology of Vertical Search Engine for Educational Resources

Research and Design of Key Technology of Vertical Search Engine for Educational Resources 2017 International Conference on Arts and Design, Education and Social Sciences (ADESS 2017) ISBN: 978-1-60595-511-7 Research and Design of Key Technology of Vertical Search Engine for Educational Resources

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI 2017 International Conference on Electronic, Control, Automation and Mechanical Engineering (ECAME 2017) ISBN: 978-1-60595-523-0 The Establishment of Large Data Mining Platform Based on Cloud Computing

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

HADOOP MAPREDUCE IN CLOUD ENVIRONMENTS FOR SCIENTIFIC DATA PROCESSING

HADOOP MAPREDUCE IN CLOUD ENVIRONMENTS FOR SCIENTIFIC DATA PROCESSING HADOOP MAPREDUCE IN CLOUD ENVIRONMENTS FOR SCIENTIFIC DATA PROCESSING 1 KONG XIANGSHENG 1 Department of Computer & Information, Xin Xiang University, Xin Xiang, China E-mail: fallsoft@163.com ABSTRACT

More information

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

The Analysis and Implementation of the K - Means Algorithm Based on Hadoop Platform

The Analysis and Implementation of the K - Means Algorithm Based on Hadoop Platform Computer and Information Science; Vol. 11, No. 1; 2018 ISSN 1913-8989 E-ISSN 1913-8997 Published by Canadian Center of Science and Education The Analysis and Implementation of the K - Means Algorithm Based

More information

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,

More information

Research on Power Quality Monitoring and Analyzing System Based on Embedded Technology

Research on Power Quality Monitoring and Analyzing System Based on Embedded Technology 2010 China International Conference on Electricity Distribution 1 Research on Power Quality Monitoring and Analyzing System Based on Embedded Technology Zhang Hong-tao, Ye Ying, An Qing China Zhoukou Power

More information

Distributed Face Recognition Using Hadoop

Distributed Face Recognition Using Hadoop Distributed Face Recognition Using Hadoop A. Thorat, V. Malhotra, S. Narvekar and A. Joshi Dept. of Computer Engineering and IT College of Engineering, Pune {abhishekthorat02@gmail.com, vinayak.malhotra20@gmail.com,

More information

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Enhanced Hadoop with Search and MapReduce Concurrency Optimization Volume 114 No. 12 2017, 323-331 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Enhanced Hadoop with Search and MapReduce Concurrency Optimization

More information

Batch Inherence of Map Reduce Framework

Batch Inherence of Map Reduce Framework Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Design and Implementation of Full Text Search Engine Based on Lucene Na-na ZHANG 1,a *, Yi-song WANG 1 and Kun ZHU 1

Design and Implementation of Full Text Search Engine Based on Lucene Na-na ZHANG 1,a *, Yi-song WANG 1 and Kun ZHU 1 2017 2 nd International Conference on Computer Science and Technology (CST 2017) ISBN: 978-1-60595-461-5 Design and Implementation of Full Text Search Engine Based on Lucene Na-na ZHANG 1,a *, Yi-song

More information

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1 International Conference on Intelligent Systems Research and Mechatronics Engineering (ISRME 2015) The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian

More information

Research and Improvement of Apriori Algorithm Based on Hadoop

Research and Improvement of Apriori Algorithm Based on Hadoop Research and Improvement of Apriori Algorithm Based on Hadoop Gao Pengfei a, Wang Jianguo b and Liu Pengcheng c School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021,

More information

A Study of Cloud Computing Scheduling Algorithm Based on Task Decomposition

A Study of Cloud Computing Scheduling Algorithm Based on Task Decomposition 2016 3 rd International Conference on Engineering Technology and Application (ICETA 2016) ISBN: 978-1-60595-383-0 A Study of Cloud Computing Scheduling Algorithm Based on Task Decomposition Feng Gao &

More information

New research on Key Technologies of unstructured data cloud storage

New research on Key Technologies of unstructured data cloud storage 2017 International Conference on Computing, Communications and Automation(I3CA 2017) New research on Key Technologies of unstructured data cloud storage Songqi Peng, Rengkui Liua, *, Futian Wang State

More information

Cooperation between Data Modeling and Simulation Modeling for Performance Analysis of Hadoop

Cooperation between Data Modeling and Simulation Modeling for Performance Analysis of Hadoop Cooperation between Data ing and Simulation ing for Performance Analysis of Hadoop Byeong Soo Kim and Tag Gon Kim Department of Electrical Engineering Korea Advanced Institute of Science and Technology

More information

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

A New HadoopBased Network Management System with Policy Approach

A New HadoopBased Network Management System with Policy Approach Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,

More information

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding

More information

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1 3rd International Conference on Machinery, Materials and Information Technology Applications (ICMMITA 2015) Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao

More information

Top 25 Hadoop Admin Interview Questions and Answers

Top 25 Hadoop Admin Interview Questions and Answers Top 25 Hadoop Admin Interview Questions and Answers 1) What daemons are needed to run a Hadoop cluster? DataNode, NameNode, TaskTracker, and JobTracker are required to run Hadoop cluster. 2) Which OS are

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

LITERATURE SURVEY (BIG DATA ANALYTICS)!

LITERATURE SURVEY (BIG DATA ANALYTICS)! LITERATURE SURVEY (BIG DATA ANALYTICS) Applications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer

More information

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 ISSN

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 ISSN 68 Improving Access Efficiency of Small Files in HDFS Monica B. Bisane, Student, Department of CSE, G.C.O.E, Amravati,India, monica9.bisane@gmail.com Asst.Prof. Pushpanjali M. Chouragade, Department of

More information

Research on Mass Image Storage Platform Based on Cloud Computing

Research on Mass Image Storage Platform Based on Cloud Computing 6th International Conference on Sensor Network and Computer Engineering (ICSNCE 2016) Research on Mass Image Storage Platform Based on Cloud Computing Xiaoqing Zhou1, a *, Jiaxiu Sun2, b and Zhiyong Zhou1,

More information

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018 Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018 K. Zhang (pic source: mapr.com/blog) Copyright BUDT 2016 758 Where

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

Improvement on PageRank Algorithm Based on User Influence

Improvement on PageRank Algorithm Based on User Influence Improvement on Algorithm Based on User Influence Yang Wang Basic Medical College ShaanXi University of Chinese Medicine Xianyang, Shanxi, China Abstract With the rapid development of the Internet, web

More information

A New Approach to Web Data Mining Based on Cloud Computing

A New Approach to Web Data Mining Based on Cloud Computing Regular Paper Journal of Computing Science and Engineering, Vol. 8, No. 4, December 2014, pp. 181-186 A New Approach to Web Data Mining Based on Cloud Computing Wenzheng Zhu* and Changhoon Lee School of

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

An Intelligent Retrieval Platform for Distributional Agriculture Science and Technology Data

An Intelligent Retrieval Platform for Distributional Agriculture Science and Technology Data An Intelligent Retrieval Platform for Distributional Agriculture Science and Technology Data Xiaorong Yang 1,2, Wensheng Wang 1,2, Qingtian Zeng 3, and Nengfu Xie 1,2 1 Agriculture Information Institute,

More information

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391 Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391 Outline Big Data Big Data Examples Challenges with traditional storage NoSQL Hadoop HDFS MapReduce Architecture 2 Big Data In information

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster 2017 2 nd International Conference on Artificial Intelligence and Engineering Applications (AIEA 2017) ISBN: 978-1-60595-485-1 Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop

More information

CLIENT DATA NODE NAME NODE

CLIENT DATA NODE NAME NODE Volume 6, Issue 12, December 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Efficiency

More information

SQL Query Optimization on Cross Nodes for Distributed System

SQL Query Optimization on Cross Nodes for Distributed System 2016 International Conference on Power, Energy Engineering and Management (PEEM 2016) ISBN: 978-1-60595-324-3 SQL Query Optimization on Cross Nodes for Distributed System Feng ZHAO 1, Qiao SUN 1, Yan-bin

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c 2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic

More information

The Design of Model for Tibetan Language Search System

The Design of Model for Tibetan Language Search System International Conference on Chemical, Material and Food Engineering (CMFE-2015) The Design of Model for Tibetan Language Search System Wang Zhong School of Information Science and Engineering Lanzhou University

More information

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer) Hortonworks Hadoop-PR000007 Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer) http://killexams.com/pass4sure/exam-detail/hadoop-pr000007 QUESTION: 99 Which one of the following

More information

Social Network Data Extraction Analysis

Social Network Data Extraction Analysis Journal homepage: www.mjret.in ISSN:2348-6953 Prajakta Kulkarni Social Network Data Extraction Analysis Pratibha Bodkhe Kalyani Hole Ashwini Kondalkar Abstract Now-a-days the use of internet is increased;

More information

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer

More information

Optimization Scheme for Storing and Accessing Huge Number of Small Files on HADOOP Distributed File System

Optimization Scheme for Storing and Accessing Huge Number of Small Files on HADOOP Distributed File System Optimization Scheme for Storing and Accessing Huge Number of Small Files on HADOOP Distributed File System L. Prasanna Kumar 1, 1 Assoc. Prof, Department of Computer Science & Engineering., Dadi Institute

More information

Big Data for Engineers Spring Resource Management

Big Data for Engineers Spring Resource Management Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models

More information

A Security Audit Module for HBase

A Security Audit Module for HBase 2016 Joint International Conference on Artificial Intelligence and Computer Engineering (AICE 2016) and International Conference on Network and Communication Security (NCS 2016) ISBN: 978-1-60595-362-5

More information

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1 HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,

More information

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(5):2057-2063 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Research of a professional search engine system

More information

Forget about the Clouds, Shoot for the MOON

Forget about the Clouds, Shoot for the MOON Forget about the Clouds, Shoot for the MOON Wu FENG feng@cs.vt.edu Dept. of Computer Science Dept. of Electrical & Computer Engineering Virginia Bioinformatics Institute September 2012, W. Feng Motivation

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. 1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map

More information

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version : Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.

More information

The Optimization and Improvement of MapReduce in Web Data Mining

The Optimization and Improvement of MapReduce in Web Data Mining Journal of Software Engineering and Applications, 2015, 8, 395-406 Published Online August 2015 in SciRes. http://www.scirp.org/journal/jsea http://dx.doi.org/10.4236/jsea.2015.88039 The Optimization and

More information

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 TOOLS

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Getting Started with Spark

Getting Started with Spark Getting Started with Spark Shadi Ibrahim March 30th, 2017 MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

An improved PageRank algorithm for Social Network User s Influence research Peng Wang, Xue Bo*, Huamin Yang, Shuangzi Sun, Songjiang Li

An improved PageRank algorithm for Social Network User s Influence research Peng Wang, Xue Bo*, Huamin Yang, Shuangzi Sun, Songjiang Li 3rd International Conference on Mechatronics and Industrial Informatics (ICMII 2015) An improved PageRank algorithm for Social Network User s Influence research Peng Wang, Xue Bo*, Huamin Yang, Shuangzi

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

A Cloud Computing Implementation of XML Indexing Method Using Hadoop

A Cloud Computing Implementation of XML Indexing Method Using Hadoop A Cloud Computing Implementation of XML Indexing Method Using Hadoop Wen-Chiao Hsu 1, I-En Liao 2,**, and Hsiao-Chen Shih 3 1,2,3 Department of Computer Science and Engineering National Chung-Hsing University,

More information

The Design and Implementation of Disaster Recovery in Dual-active Cloud Center

The Design and Implementation of Disaster Recovery in Dual-active Cloud Center International Conference on Information Sciences, Machinery, Materials and Energy (ICISMME 2015) The Design and Implementation of Disaster Recovery in Dual-active Cloud Center Xiao Chen 1, a, Longjun Zhang

More information

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d 4th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2016) Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b,

More information

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework Facilitating Consistency Check between Specification & Implementation with MapReduce Framework Shigeru KUSAKABE, Yoichi OMORI, Keijiro ARAKI Kyushu University, Japan 2 Our expectation Light-weight formal

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ] s@lm@n Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ] Question No : 1 Which two updates occur when a client application opens a stream

More information

Performance Optimization for Short MapReduce Job Execution in Hadoop

Performance Optimization for Short MapReduce Job Execution in Hadoop 2012 Second International Conference on Cloud and Green Computing Performance Optimization for Short MapReduce Job Execution in Hadoop Jinshuang Yan, Xiaoliang Yang, Rong Gu, Chunfeng Yuan, and Yihua Huang

More information

Research Article Apriori Association Rule Algorithms using VMware Environment

Research Article Apriori Association Rule Algorithms using VMware Environment Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,

More information

Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop

Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop Jiaqi Tan Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan PARALLEL DATA LABORATORY Carnegie Mellon University Motivation Debugging

More information

Research on Full-text Retrieval based on Lucene in Enterprise Content Management System Lixin Xu 1, a, XiaoLin Fu 2, b, Chunhua Zhang 1, c

Research on Full-text Retrieval based on Lucene in Enterprise Content Management System Lixin Xu 1, a, XiaoLin Fu 2, b, Chunhua Zhang 1, c Applied Mechanics and Materials Submitted: 2014-07-18 ISSN: 1662-7482, Vols. 644-650, pp 1950-1953 Accepted: 2014-07-21 doi:10.4028/www.scientific.net/amm.644-650.1950 Online: 2014-09-22 2014 Trans Tech

More information

Projected by: LUKA CECXLADZE BEQA CHELIDZE Superviser : Nodar Momtsemlidze

Projected by: LUKA CECXLADZE BEQA CHELIDZE Superviser : Nodar Momtsemlidze Projected by: LUKA CECXLADZE BEQA CHELIDZE Superviser : Nodar Momtsemlidze About HBase HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data

More information

Hadoop MapReduce Framework

Hadoop MapReduce Framework Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce

More information

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved. Gain Insights From Unstructured Data Using Pivotal HD 1 Traditional Enterprise Analytics Process 2 The Fundamental Paradigm Shift Internet age and exploding data growth Enterprises leverage new data sources

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

TP1-2: Analyzing Hadoop Logs

TP1-2: Analyzing Hadoop Logs TP1-2: Analyzing Hadoop Logs Shadi Ibrahim January 26th, 2017 MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development

More information

Key Technology of Online Writing System Development Hongmei Zhao

Key Technology of Online Writing System Development Hongmei Zhao 3rd International Conference on Machinery, Materials and Information Technology Applications (ICMMITA 2015) Key Technology of Online Writing System Development Hongmei Zhao College of Education and Sports,

More information

Project Design. Version May, Computer Science Department, Texas Christian University

Project Design. Version May, Computer Science Department, Texas Christian University Project Design Version 4.0 2 May, 2016 2015-2016 Computer Science Department, Texas Christian University Revision Signatures By signing the following document, the team member is acknowledging that he

More information

Clustering and Correlation based Collaborative Filtering Algorithm for Cloud Platform

Clustering and Correlation based Collaborative Filtering Algorithm for Cloud Platform Clustering and Correlation based Collaborative Filtering Algorithm for Cloud Platform Xian Zhong, Guang Yang, Lin Li, Luo Zhong Abstract With the development of the Internet, recommender systems have played

More information

Web-Page Indexing Based on the Prioritized Ontology Terms

Web-Page Indexing Based on the Prioritized Ontology Terms Web-Page Indexing Based on the Prioritized Ontology Terms Sukanta Sinha 1,2, Rana Dattagupta 2, and Debajyoti Mukhopadhyay 1,3 1 WIDiCoReL Research Lab, Green Tower, C-9/1, Golf Green, Kolkata 700095,

More information

Efficient Algorithm for Frequent Itemset Generation in Big Data

Efficient Algorithm for Frequent Itemset Generation in Big Data Efficient Algorithm for Frequent Itemset Generation in Big Data Anbumalar Smilin V, Siddique Ibrahim S.P, Dr.M.Sivabalakrishnan P.G. Student, Department of Computer Science and Engineering, Kumaraguru

More information