SQL Query Optimization on Cross Nodes for Distributed System
|
|
- Rolf Howard
- 5 years ago
- Views:
Transcription
1 2016 International Conference on Power, Energy Engineering and Management (PEEM 2016) ISBN: SQL Query Optimization on Cross Nodes for Distributed System Feng ZHAO 1, Qiao SUN 1, Yan-bin JIAO 1 and Jia-song SUN 2,* 1 Beijing GuoDianTong Network Technology Co., Ltd, Beijing, China 2 E. E. Department, Tsinghua University, Beijing, China *Corresponding author Keywords: Query path optimization, Cross nodes, Cost of query, SQL, Distributed system. Abstract. Query terms are also expanded from several dimensions of simple query expansion to complex query of multi dimension. It is becoming more and more difficult to extract, store and analyze the massive data by using the traditional database software. So the better query strategy is the key to optimize the database query operation. As the important method of database analysis data, SQL query plays an important role in analyzing and processing data. Through SQL inquiries, users can get their most concerned information quickly. With the continuous development of the Internet industry, the data need to deal with in database continues to expand its scale. Database shows its characteristics as large amount, multiple types, fast processing speed and low density value. This paper proposed a new global adaptive optimization processing method in order to achieve the purpose of reducing the number of data I/O and load balance when SQL query is carried out in the distributed parallel system. Its characteristic is: to build a multi-factor decision fuzzy evaluation model for each sub-query path optimization decision and to definite the optimized global cost function for adaptive optimization on global query path which will satisfy the total cost minimum requirements for the purpose of query. The experiments show that the global query method has the faster query speed. At the same time, the total cost of query is controllable by defining the global optimized cost function. Introduction The distributed database system is the organic combination of computer network and database system. Because of the large amount of data transmission in the network, query processing and optimization becomes the key factor to improve the query performance of the distributed database. Query processing and optimization is through the reasonable algorithm to reduce the amount of information, so as to improve the response time of the query and reduce the cost of system. The query optimizer is widely considered to be the most important part of a database system. The main aim of the optimizer is to take a user query and to provide a detailed plan called a Query Execution Plan (QEP) that indicates to the executer exactly how the query should be executed. The problem that the optimizer faces is that for a given user query there exists a large space of different equivalent QEPs that each have a corresponding execution cost. The plans are equivalent in the sense that they return the same result for the user query but the cost of plans may differ by orders of magnitude. In a centralized database system an estimate of the number of I/Os performed is typically used as a cost metric for a plan. If the optimizer choses a plan with a poor cost the execution can take several days while another plan may exist that performs the execution in seconds [1]. The retrieval of data from different sites in a network is known as distributed query processing [2]. The difference between query processing in a centralized database and a distributed database is the potential for decomposing a query into sub queries which can be processed in parallel, and their intermediate results can be sent in parallel to the required computers. Finding an efficient way of processing a query is important. If a query is processed inefficiently, it not only takes a long time before the end user gets his answer, but it might also decrease the performance of the whole system because of network congestion.
2 There are two distinct types of nodes that implement the query processing functionality [3]: 1) Control Node. The control node manages the distribution of query execution across the compute nodes, accepts client connections to the PDW appliance and manages client authentication. In addition to containing a SQL Server instance, the control node contains additional software to support the distributed architecture of the PDW. This includes the engine that coordinates the data warehousing functions that are specific to processing parallel queries, stores appliance-wide metadata and configuration data, and manages appliance and database authentication and authorization. 2) Compute Nodes. Each compute node is the host for a single SQL Server instance. It also runs a DMS process for communication and data transfer with the other nodes in the appliance. Each compute node stores a portion of the user data. Big data applications often need to access datasets on different platforms that may even be cross-domain. For structured data, the time cost of data extraction and loading cannot meet the real-time requirement. The different platforms involved interconnect via LAN or Internet, resulting in a distributed and heterogeneous network topology. In this network topology, the data sources are dynamic, heterogeneous, and autonomous. The literature [4] implement a cross-platform query interface using which the clients can directly execute online join query between discrete deployments of Banian or between Banian and any other relational database (such as MySQL and Oracle). The cross-platform query interface contains three main components: SQL interface, cross-platform module, and global table. The SQL interface provides a command shell for users and forwards query commands to the cross platform module. If a request command involves several datasets on different platforms, the cross platform module queries the global table and gets the information of Location (a data structure). Then, it splits the command according to the variable tag name of Location, sends the sub-command to the slave platform as master, and receives the result. At present, the generally query steps for the distributed system is: Query decomposition according to the user's query content and initialization query path; Checking the local database. If there is, then apply local implementation; if there is not, then global query processing module according to the path query to select a node processing the query optimization. Namely, the choice of the database manipulation table query cost minimum database node. The connection with the optimized node is established, and the query command is transmitted to the optimized node to execute. In the process, because the distributed database system exists in the network environment, it must take account of the communication costs between the nodes and the distributed computing processing. The current steps of query decomposition, data localization, local optimization, global optimization method, running with the communication cost and query the actual cost, is still unable to satisfy the requirements of users and globally optimal execution nodes is not available in the distributed systems. This paper investigates a new optimization method in order to achieve the purpose of reducing the number of data I/O and load balance when SQL query is carried out in the distributed parallel system. Its characteristic is: to build a multi-factor decision fuzzy evaluation model for each sub-query path optimization decision and to definite the optimized global cost function for adaptive optimization on global query path which will satisfy the total cost minimum requirements for the purpose of query. The experiments show that the global query method has the faster query speed. At the same time, the total cost of query is controllable by defining the global optimized cost function. The rest of this paper is organized as follows. Section 2 provides a generally query introduction for the distributed system. Section 3 demonstrates the new optimization method can improve the query speed and total cost of query. Section 4 demonstrates the performance of this approach. Section 5 is the conclusion. Query for the Distributed System Big data is currently a research focus in both academic and industry. To analyze massive amounts of data and obtain valuable information and knowledge, researchers have developed many excellent
3 systems and technologies [5, 6, 7, and 8]. Query processing and optimization is through the reasonable algorithm to reduce the amount of information, so as to improve the response time of the query and reduce the system cost. Compared with the traditional single machine optimization method, the distributed query optimization has better data reliability, faster query speed and scalable storage capacity. Distributed query optimization generally include: query decomposition, data localization, local optimization, global optimization. Its concrete way is: 1) Query decomposition Query decomposition is a relational algebra expression which is defined as the relationship between the global relations and the query problem (such as the SQL statement). 2) Data localization Data localization is to implement a query on the global relationship to implement the query to the appropriate (to make possible localization or near to the localization) of the query. 3) Global optimization The input of global optimization is the query of the slice, that is, the query on the segment. The goal of query optimization is to find a near optimal execution strategy. The global optimization is the best operation order to find the split query, including the minimum cost function. The output of the global optimization processing layer is an optimized and the relational algebra query over the segment. 4) Local optimization The local query optimization consists of all the sites executing the fragments related to the query. The sub queries executed at each site are called local queries. It is optimized by the DBMS on the site, and the algorithm is optimized by using the centralized database system. Figure 1. Steps of overall merging implementation. Table 1. Query Latency of GA and TF under 10 SQL commands. Commands Query Latency (s) Commands Query Latency (s) GA TF GA TF Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ New Global Adaptive Optimization Approach Query terms are extended to complex queries by several dimensions and simple combinatorial queries. Using traditional database software for massive data of extraction, the storage, the analysis becomes more and more difficult to get results in real time, so the choice of better query strategy becomes the focus of database optimization of complex query operations. This paper proposed a new adaptive global optimization approach to achieve in distributed parallel SQL query system reduced the frequency of data I/O and load balancing. Its characteristics are: to build a multi factor decision fuzzy assessment model for each sub query path optimization decision, and the definition of global optimization of the cost function of global query path adaptive optimization, to meet the total cost minimum requirements for the purpose of query. The implementation steps in detail (shows in the Fig.1):
4 1) Computing global query total cost Determine the total cost of the global query minimum requirements which defined as the weighted sum of the error and response time; 2) Local optimization phase In the local optimization phase, to construct a multi factor decision fuzzy assessment model, for each sub query path optimization decision, through the query decomposition and data localization of multifactor decision making fuzzy evaluation. The results of the assessment as the input of the local optimization 3) Global optimization phase In the global optimization stage, the objective of the global optimization is to find the best operation sequence of the slice query, and minimize the cost function. Through the definition of global optimization of the cost function, using based on BP neural network adaptive method for all query path overall adaptive optimization, makes the global total query cost meet in step 1 of minimum requirements. The advantage of this method is that the global query has a faster query speed by optimizing the query path, and the total cost of the query is controllable by defining the global optimization cost function. Table 2. Query Latency of GA and TF under 20 SQL commands. Commands Query Latency (s) Commands Query Latency (s) GA TF GA TF Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Commands_ Experiments and Performance In this section, we evaluate the performance and scalability of our global adaptive optimization approach (GA) and compare the results with those of traditional four steps (TF) of query for the distributed system. Firstly, on a cluster containing 10 nodes, we run ten SQL commands on a 10GB database to evaluate the query latency, refer to Table.1; then, we run twenty SQL commands on a 100GB database to compare and analysis the results of the queries, refer to Table.2. During this evaluation, all the SQL do 50 times and the average value is provided. In the 10GB level queries, the Query Latency of GA is average 2.34 times faster than TF, especially for command_6 as 3.5 times. In the 100GB level queries, the Query Latency of GA is average 2.38 times faster than TF, especially for command_1 as 4.26 times. From the experiments we can conclude that the global adaptive optimization approach technique is significantly advantageous compared with the traditional mechanism. Summary Query processing and optimization is the key factor to improve the query performance of the distributed database. This paper proposed a new adaptive global optimization approach to achieve in distributed parallel SQL query system reduced the frequency of data I/O and load balancing. Its merit is the optimization of each sub slice query path and reducing the computational burden of global
5 optimization. Another advantage is that the adaptive optimization of the query is realized by the adaptive optimization of the global query path based on the neural network. Acknowledgement This research was financially supported by Science and Technology Project of the State Grid Corporation of China (SGZJ0000BGJS ) and the State Grid Information & Telecommunication Group CO., LTD.(SGITG-KJ-JSKF[2015]0003). References [1] Robert Taylor. Query Optimization for Distributed Database Systems. Master thesis of University of Oxford, August [2] P.M.G. Apers, A.R. Hevner, S.B. Yao. Optimization Algorithms for Distributed Queries. IEEE Transactions on Software Engineering, Vol 9: 1, [3] [4] Tao Xu, Dongsheng Wang, and Guodong Liu. Banian: A Cross-Platform Interactive Query System for Structured Big Data. Tsinghua Science and Technology, ISSN, , 07/11, pp 62-71, Volume 20, Number 1, February [5] S. Ghemawat, H. Gobioff, and S.T. Leung, The Google file system, ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp , [6] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Commun. of ACM, vol. 51, no. 1, pp , [7] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The Hadoop distributed file system, in Proceedings of IEEE Conference on Mass Storage Systems and Technologies (MSST), 2010, pp [8] D. Borthakur, J. Grap, J.S. Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, et al., Apache Hadoop goes realtime at facebook, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 2011, pp
A Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More informationA Novel Time Interval based Algorithm for Data Fetching on Bigdata
A Novel Time Interval based Algorithm for Data Fetching on Bigdata M. Banupriya Mrs. K. Uma Maheswari PG Scholar Assistant Professor Department of CSE/IT Department of CSE/IT University College of Engineering
More informationBatch Inherence of Map Reduce Framework
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287
More informationCassandra- A Distributed Database
Cassandra- A Distributed Database Tulika Gupta Department of Information Technology Poornima Institute of Engineering and Technology Jaipur, Rajasthan, India Abstract- A relational database is a traditional
More informationCSE-E5430 Scalable Cloud Computing Lecture 9
CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay
More informationAnalyzing and Improving Load Balancing Algorithm of MooseFS
, pp. 169-176 http://dx.doi.org/10.14257/ijgdc.2014.7.4.16 Analyzing and Improving Load Balancing Algorithm of MooseFS Zhang Baojun 1, Pan Ruifang 1 and Ye Fujun 2 1. New Media Institute, Zhejiang University
More informationAn Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing
[Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 11 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(11), 2014 [5368-5376] The study on magnanimous data-storage system based
More informationDynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c
2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic
More informationOpen Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments
Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing
More information18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.
18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File
More informationIndexing Strategies of MapReduce for Information Retrieval in Big Data
International Journal of Advances in Computer Science and Technology (IJACST), Vol.5, No.3, Pages : 01-06 (2016) Indexing Strategies of MapReduce for Information Retrieval in Big Data Mazen Farid, Rohaya
More informationNew research on Key Technologies of unstructured data cloud storage
2017 International Conference on Computing, Communications and Automation(I3CA 2017) New research on Key Technologies of unstructured data cloud storage Songqi Peng, Rengkui Liua, *, Futian Wang State
More informationJournal of East China Normal University (Natural Science) Data calculation and performance optimization of dairy traceability based on Hadoop/Hive
4 2018 7 ( ) Journal of East China Normal University (Natural Science) No. 4 Jul. 2018 : 1000-5641(2018)04-0099-10 Hadoop/Hive 1, 1, 1, 1,2, 1, 1 (1., 210095; 2., 210095) :,, Hadoop/Hive, Hadoop/Hive.,,
More informationHigh Performance Computing on MapReduce Programming Framework
International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming
More informationColumn Stores and HBase. Rui LIU, Maksim Hrytsenia
Column Stores and HBase Rui LIU, Maksim Hrytsenia December 2017 Contents 1 Hadoop 2 1.1 Creation................................ 2 2 HBase 3 2.1 Column Store Database....................... 3 2.2 HBase
More informationFAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY
, pp-01-05 FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY Ravin Ahuja 1, Anindya Lahiri 2, Nitesh Jain 3, Aditya Gabrani 4 1 Corresponding Author PhD scholar with the Department of Computer Engineering,
More informationResearch on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang
International Conference on Engineering Management (Iconf-EM 2016) Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang School of
More informationThe MapReduce Framework
The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce
More informationBESIII Physical Analysis on Hadoop Platform
BESIII Physical Analysis on Hadoop Platform Jing HUO 12, Dongsong ZANG 12, Xiaofeng LEI 12, Qiang LI 12, Gongxing SUN 1 1 Institute of High Energy Physics, Beijing, China 2 University of Chinese Academy
More information18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.
18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File
More informationYuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013
Yuval Carmel Tel-Aviv University "Advanced Topics in About & Keywords Motivation & Purpose Assumptions Architecture overview & Comparison Measurements How does it fit in? The Future 2 About & Keywords
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationInternational Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 ISSN
68 Improving Access Efficiency of Small Files in HDFS Monica B. Bisane, Student, Department of CSE, G.C.O.E, Amravati,India, monica9.bisane@gmail.com Asst.Prof. Pushpanjali M. Chouragade, Department of
More informationFast and Effective System for Name Entity Recognition on Big Data
International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-3, Issue-2 E-ISSN: 2347-2693 Fast and Effective System for Name Entity Recognition on Big Data Jigyasa Nigam
More informationCooperation between Data Modeling and Simulation Modeling for Performance Analysis of Hadoop
Cooperation between Data ing and Simulation ing for Performance Analysis of Hadoop Byeong Soo Kim and Tag Gon Kim Department of Electrical Engineering Korea Advanced Institute of Science and Technology
More informationAPPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B.
APPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B.Kadu PREC, Loni, India. ABSTRACT- Today in the world of
More informationThe Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c
Applied Mechanics and Materials Online: 2013-09-27 ISSN: 1662-7482, Vols. 423-426, pp 2733-2736 doi:10.4028/www.scientific.net/amm.423-426.2733 2013 Trans Tech Publications, Switzerland The Design of Distributed
More informationQADR with Energy Consumption for DIA in Cloud
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
More informationDistributed MapReduce Engine with Fault Tolerance
Distributed MapReduce Engine with Fault Tolerance Lixing Song, Shaoen Wu Dept. of Computer Science Ball State University Muncie, IN {lsong, swu}@bsu.edu Honggang Wang Dept. of Electrical and Computer Engineering
More informationResearch on Heterogeneous Data resource Management Model in Cloud Environment
, pp.141-152 http://dx.doi.org/10.14257/ijdta.2013.6.5.13 Research on Heterogeneous Data resource Management Model in Cloud Environment Tao Sun 1,2 and Xinjun Wang 1 1 School of Computer Science and Technology,
More informationGoogle File System (GFS) and Hadoop Distributed File System (HDFS)
Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear
More informationResearch Article Mobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 DOI:10.19026/ajfst.5.3106 ISSN: 2042-4868; e-issn: 2042-4876 2013 Maxwell Scientific Publication Corp. Submitted: May 29, 2013 Accepted:
More informationReview On Data Replication with QoS and Energy Consumption for Data Intensive Applications in Cloud Computing
Review On Data Replication with QoS and Energy Consumption for Data Intensive Applications in Cloud Computing Ms. More Reena S 1, Prof.Nilesh V. Alone 2 Department of Computer Engg, University of Pune
More informationHuge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2
2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering
More informationA New HadoopBased Network Management System with Policy Approach
Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,
More informationEXTRACT DATA IN LARGE DATABASE WITH HADOOP
International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0
More informationProcessing Technology of Massive Human Health Data Based on Hadoop
6th International Conference on Machinery, Materials, Environment, Biotechnology and Computer (MMEBC 2016) Processing Technology of Massive Human Health Data Based on Hadoop Miao Liu1, a, Junsheng Yu1,
More informationCLOUD-SCALE FILE SYSTEMS
Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients
More informationThe Google File System. Alexandru Costan
1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems
More informationModeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment
DEIM Forum 213 F2-1 Adaptive indexing 153 855 4-6-1 E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with
More informationSurvey Paper on Traditional Hadoop and Pipelined Map Reduce
International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,
More informationData Analysis Using MapReduce in Hadoop Environment
Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti
More informationIMPLEMENTATION OF INFORMATION RETRIEVAL (IR) ALGORITHM FOR CLOUD COMPUTING: A COMPARATIVE STUDY BETWEEN WITH AND WITHOUT MAPREDUCE MECHANISM *
Journal of Contemporary Issues in Business Research ISSN 2305-8277 (Online), 2012, Vol. 1, No. 2, 42-56. Copyright of the Academic Journals JCIBR All rights reserved. IMPLEMENTATION OF INFORMATION RETRIEVAL
More informationInternational Journal of Advance Engineering and Research Development. A Study: Hadoop Framework
Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja
More informationAnalysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark
Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark PL.Marichamy 1, M.Phil Research Scholar, Department of Computer Application, Alagappa University, Karaikudi,
More informationImproved MapReduce k-means Clustering Algorithm with Combiner
2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering
More informationThe Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI
2017 International Conference on Electronic, Control, Automation and Mechanical Engineering (ECAME 2017) ISBN: 978-1-60595-523-0 The Establishment of Large Data Mining Platform Based on Cloud Computing
More informationEfficient Entity Matching over Multiple Data Sources with MapReduce
Efficient Entity Matching over Multiple Data Sources with MapReduce Demetrio Gomes Mestre, Carlos Eduardo Pires Universidade Federal de Campina Grande, Brazil demetriogm@gmail.com, cesp@dsc.ufcg.edu.br
More information4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)
4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,
More informationA New Model of Search Engine based on Cloud Computing
A New Model of Search Engine based on Cloud Computing DING Jian-li 1,2, YANG Bo 1 1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China 2. Tianjin Key
More informationResearch and Improvement of Apriori Algorithm Based on Hadoop
Research and Improvement of Apriori Algorithm Based on Hadoop Gao Pengfei a, Wang Jianguo b and Liu Pengcheng c School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021,
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationSurvey on MapReduce Scheduling Algorithms
Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used
More informationImproved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG *
2016 Joint International Conference on Artificial Intelligence and Computer Engineering (AICE 2016) and International Conference on Network and Communication Security (NCS 2016) ISBN: 978-1-60595-362-5
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationSQL-to-MapReduce Translation for Efficient OLAP Query Processing
, pp.61-70 http://dx.doi.org/10.14257/ijdta.2017.10.6.05 SQL-to-MapReduce Translation for Efficient OLAP Query Processing with MapReduce Hyeon Gyu Kim Department of Computer Engineering, Sahmyook University,
More informationA Review to the Approach for Transformation of Data from MySQL to NoSQL
A Review to the Approach for Transformation of Data from MySQL to NoSQL Monika 1 and Ashok 2 1 M. Tech. Scholar, Department of Computer Science and Engineering, BITS College of Engineering, Bhiwani, Haryana
More informationPerformance Comparison of Hive, Pig & Map Reduce over Variety of Big Data
Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics
More informationParallel data processing with MapReduce
Parallel data processing with MapReduce Tomi Aarnio Helsinki University of Technology tomi.aarnio@hut.fi Abstract MapReduce is a parallel programming model and an associated implementation introduced by
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationMRBench : A Benchmark for Map-Reduce Framework
MRBench : A Benchmark for Map-Reduce Framework Kiyoung Kim, Kyungho Jeon, Hyuck Han, Shin-gyu Kim, Hyungsoo Jung, Heon Y. Yeom School of Computer Science and Engineering Seoul National University Seoul
More informationMitigating Data Skew Using Map Reduce Application
Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,
More informationHADOOP FRAMEWORK FOR BIG DATA
HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further
More informationA Security Audit Module for HBase
2016 Joint International Conference on Artificial Intelligence and Computer Engineering (AICE 2016) and International Conference on Network and Communication Security (NCS 2016) ISBN: 978-1-60595-362-5
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationSurvey on Incremental MapReduce for Data Mining
Survey on Incremental MapReduce for Data Mining Trupti M. Shinde 1, Prof.S.V.Chobe 2 1 Research Scholar, Computer Engineering Dept., Dr. D. Y. Patil Institute of Engineering &Technology, 2 Associate Professor,
More informationOn The Fly Mapreduce Aggregation for Big Data Processing In Hadoop Environment
ISSN (e): 2250 3005 Volume, 07 Issue, 07 July 2017 International Journal of Computational Engineering Research (IJCER) On The Fly Mapreduce Aggregation for Big Data Processing In Hadoop Environment Ms.
More informationSearching frequent itemsets by clustering data: towards a parallel approach using MapReduce
Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce Maria Malek and Hubert Kadima EISTI-LARIS laboratory, Ave du Parc, 95011 Cergy-Pontoise, FRANCE {maria.malek,hubert.kadima}@eisti.fr
More informationClassification and Optimization using RF and Genetic Algorithm
International Journal of Management, IT & Engineering Vol. 8 Issue 4, April 2018, ISSN: 2249-0558 Impact Factor: 7.119 Journal Homepage: Double-Blind Peer Reviewed Refereed Open Access International Journal
More informationDynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce
Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,
More informationIntroduction to MapReduce Algorithms and Analysis
Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)
More informationResearch and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d
4th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2016) Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b,
More informationCLIENT DATA NODE NAME NODE
Volume 6, Issue 12, December 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Efficiency
More informationLarge-Scale Web Traffic Log Analyzer using Cloudera Impala on Hadoop Distributed File System
Large-Scale Web Traffic Log Analyzer using Cloudera Impala on Hadoop Distributed File System Choopan Rattanapoka * and Prasertsak Tiawongsombat Abstract Resource planning and data analysis are important
More informationImplementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b
International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory
More informationPerformance Models of Access Latency in Cloud Storage Systems
Performance Models of Access Latency in Cloud Storage Systems Qiqi Shuai Email: qqshuai@eee.hku.hk Victor O.K. Li, Fellow, IEEE Email: vli@eee.hku.hk Yixuan Zhu Email: yxzhu@eee.hku.hk Abstract Access
More information1. Introduction to MapReduce
Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.
More informationResearch on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster
2017 2 nd International Conference on Artificial Intelligence and Engineering Applications (AIEA 2017) ISBN: 978-1-60595-485-1 Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop
More informationPer-Packet Load Balancing in Data Center Networks
Per-Packet Load Balancing in Data Center Networks Yagiz Kaymak and Roberto Rojas-Cessa Abstract In this paper, we evaluate the performance of perpacket load in data center networks (DCNs). Throughput and
More informationAn Efficient Provable Data Possession Scheme based on Counting Bloom Filter for Dynamic Data in the Cloud Storage
, pp. 9-16 http://dx.doi.org/10.14257/ijmue.2016.11.4.02 An Efficient Provable Data Possession Scheme based on Counting Bloom Filter for Dynamic Data in the Cloud Storage Eunmi Jung 1 and Junho Jeong 2
More informationBigdata Platform Design and Implementation Model
Indian Journal of Science and Technology, Vol 8(18), DOI: 10.17485/ijst/2015/v8i18/75864, August 2015 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Bigdata Platform Design and Implementation Model
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationAn Algorithm of Association Rule Based on Cloud Computing
Send Orders for Reprints to reprints@benthamscience.ae 1748 The Open Automation and Control Systems Journal, 2014, 6, 1748-1753 An Algorithm of Association Rule Based on Cloud Computing Open Access Fei
More informationA Comparative study of Clustering Algorithms using MapReduce in Hadoop
A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering
More informationParallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem
I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **
More informationDesign Considerations on Implementing an Indoor Moving Objects Management System
, pp.60-64 http://dx.doi.org/10.14257/astl.2014.45.12 Design Considerations on Implementing an s Management System Qian Wang, Qianyuan Li, Na Wang, Peiquan Jin School of Computer Science and Technology,
More informationParallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce
Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over
More informationBig Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012
Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data Fall 2012 Data Warehousing and OLAP Introduction Decision Support Technology On Line Analytical Processing Star Schema
More informationApache Spark Graph Performance with Memory1. February Page 1 of 13
Apache Spark Graph Performance with Memory1 February 2017 Page 1 of 13 Abstract Apache Spark is a powerful open source distributed computing platform focused on high speed, large scale data processing
More informationAn Improved Performance Evaluation on Large-Scale Data using MapReduce Technique
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,
More informationGlobal Journal of Engineering Science and Research Management
A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV
More informationA Balancing Algorithm in Wireless Sensor Network Based on the Assistance of Approaching Nodes
Sensors & Transducers 2013 by IFSA http://www.sensorsportal.com A Balancing Algorithm in Wireless Sensor Network Based on the Assistance of Approaching Nodes 1,* Chengpei Tang, 1 Jiao Yin, 1 Yu Dong 1
More informationIN organizations, most of their computers are
Provisioning Hadoop Virtual Cluster in Opportunistic Cluster Arindam Choudhury, Elisa Heymann, Miquel Angel Senar 1 Abstract Traditional opportunistic cluster is designed for running compute-intensive
More informationSDS: A Scalable Data Services System in Data Grid
SDS: A Scalable Data s System in Data Grid Xiaoning Peng School of Information Science & Engineering, Central South University Changsha 410083, China Department of Computer Science and Technology, Huaihua
More informationLogging Reservoir Evaluation Based on Spark. Meng-xin SONG*, Hong-ping MIAO and Yao SUN
2017 2nd International Conference on Wireless Communication and Network Engineering (WCNE 2017) ISBN: 978-1-60595-531-5 Logging Reservoir Evaluation Based on Spark Meng-xin SONG*, Hong-ping MIAO and Yao
More informationAn Approximately Duplicate Records Detection Method for Electric Power Big Data Based on Spark and IPOP-Simhash
Journal of Information Hiding and Multimedia Signal Processing c 2018 ISSN 2073-4212 Ubiquitous International Volume 9, Number 2, March 2018 An Approximately Duplicate Records Detection Method for Electric
More informationParallel Implementation of Fuzzy Clustering Algorithm Based on MapReduce Computing Model of Hadoop A Detailed Survey
Parallel Implementation of Fuzzy Clustering Algorithm Based on MapReduce Computing Model of Hadoop A Detailed Survey Jerril Mathson Mathew M.Tech Student College of Engineering Kidangoor Kerala, India
More informationABSTRACT I. INTRODUCTION
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve
More informationJumbo: Beyond MapReduce for Workload Balancing
Jumbo: Beyond Reduce for Workload Balancing Sven Groot Supervised by Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan sgroot@tkl.iis.u-tokyo.ac.jp
More informationReal-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b
4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2015) Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 1
More information