Multi-indexed Graph Based Knowledge Storage System

Size: px
Start display at page:

Download "Multi-indexed Graph Based Knowledge Storage System"

Transcription

1 Multi-indexed Graph Based Knowledge Storage System Hongming Zhu 1,2, Danny Morton 2, Wenjun Zhou 3, Qin Liu 1, and You Zhou 1 1 School of software engineering, Tongji University, China {zhu_hongming,qin.liu}@tongji.edu.cn, zyrusher@gmail.com 2 Well-Being and Social Sciences, Bolton University, UK dm3@bolton.ac.uk 3 School of computer science, Tongji University, China zhouwenjun77@hotmail.com Abstract. With the rapid development of information technologies, knowledge management systems are facing the problem of how to manage the massive volume of data and make an efficient usage for the data. In this paper, we analyzed the current challenge of knowledge management system and proposed a multi-indexed graph based knowledge storage system to avoid the data duplication and optimized for parallel processing. Keywords: Index, Storage System, Knowledge Management. 1 Introduction With the development of information technologies, access to datasets from many different sources is an important part of any knowledge management process. It is still very difficult, however, to extract the exact meaning, or related attribute, of the knowledge within the dataset. This paper explores the framework for the management, coding, and analysis of large heterogeneous datasets. In any knowledge management system, we usually find that the dataset comes from different sources with different formats, and varying degrees of structure. Some of the datasets are well structured, but there is also a lot of unstructured data arising from sources such as, wiki data, web data and social network data. The application of multiple sematic meanings to the same dataset, means that different applications may interpret or use the dataset in different ways. The objective of the research described in this paper was to design a framework of data storage, which would help overcome these problems in knowledge management systems. 2 Motivation 2.1 The Knowledge Management System Should Be Scalable and Optimized for Massive Dataset Analysis With the massive explosion of data in knowledge management system, we have to handle large volumes of dataset during data analysis. How to avoid the data duplication Z. Huang et al. (Eds.): WISE 2013 Workshops 2013, LNCS 8182, pp , Springer-Verlag Berlin Heidelberg 2014

2 Multi-Indexed Graph Based Knowledge Storage System 293 in different knowledge analysis project and how to optimized for parallel process of the data becomes the key challenges in modern knowledge management system. A unified, scalable and parallel storage architecture is needed for knowledge management system. 2.2 Structured, Semi-structured and Unstructured Data Can Be Managed in the Same Way In most of the knowledge management systems, the datasets will come from individuals or organizations. Some of these datasets will be well structured and based on the requirements of the originator, but it will not necessarily be very suitable for other analysis and processing around a different requirement. Indeed some datasets will be semi-structured and even unstructured, such as web data and social network datasets. The key is to design a storage and access system, which will work in the same way for all different types of dataset irrespective of their source and format. 2.3 Sematic Description Can Be Easily Added in the Dataset In any knowledge management system, it is very difficult to define a proper schema for a data set and never change it. In different applications, we may focus on different schemas for the same dataset, and we may need to add more sematic meanings to the dataset during the process of analysis. For example, when dealing with a customer referral system, we may regard the address as a whole sentence but when dealing with a logistical application we may need to divide the address into countries, states, streets etc., in order to quantify the data. Within the storage system, one dataset may have several schemas depending on the application. 2.4 Sematic Meanings of Knowledge Can Be Easily Added in the Graph Based Knowledge Management System In a knowledge management system, we will often have different definitions for the knowledge. Knowledge is usually defined by certain properties and there may also be linkages between different parts of the knowledge dataset. In order to describe the different sematic meanings of knowledge, we will regard a particular knowledge as a knowledge node type, atomic knowledge as an instance of a knowledge type node. For example, technical publications in a particular field can be regarded as a knowledge node type ; a single publication in this field can be an instance of the node type. The edge is defined as the linkage between the knowledge nodes. For example, papers written by the same author. Thus we can build a heterogeneous graph for knowledge management problems. We will import the idea of a knowledge schema to describe the heterogeneous graph in the knowledge management system. 2.5 The Storage System Can Be Scalable and Have Good Performance in Knowledge Processing With an increasing number of datasets within a knowledge management system, we have to think about the scalability, performance, and the processing of the data.

3 294 H. Zhu et al. Distributed storage and processing is a potential solution for this problem. In order to get better performance, we do need to add some index in the distributed storage and process system. The index needs to be easily scaled. We can add and update an index based on graph schema, dataset schema, and the dataset. The index also needs to be processed in parallel for better performance. We will import the idea of index and delayed loading of a dataset into the knowledge management system. 3 Design Overview The distributed storage system we proposed is based on the Google distributed file system GFS [6]. It contains four parts: data nodes, which contains the dataset; data schema nodes and graph knowledge schema nodes, which contain the dataset schema and graph schema; index nodes, which contain the index for the dataset; graph knowledge nodes, which contain the knowledge graph. Figure 1 shows the architecture of the proposed multi indexed knowledge management storage system. Read ops NameNode Client Read Metadata ops Graph ops Graph Nodes Block ops SchemaNodes Index ops Metadata (Name, replicas,...) /home/foo/data,3,... Schema Blocks SchemaNodes Metadata (Name, replicas,...) /home/foo/data,3,... IndexNodes IndexNodes Blocks Replication DataNodes DataNodes DataNodes DataNodes DataNodes Rack 1 Rack 2 Write Client Fig. 1. The Logical Architecture of Multi-Indexed Storage System in Knowledge Management The dataset is stored as a pure row data in the distributed file system based on Hadoop. Hadoop [7] is an open source software framework that supports data intensive distributed applications, is designed for big data processing, and supports the running

4 Multi-Indexed Graph Based Knowledge Storage System 295 of applications on large clusters of commodity hardware. We add index features in the Hadoop system to achieve the knowledge management system. We regard the dataset, no matter whether it is a structured data or unstructured data, as files, which will be stored in the Hadoop system as row files. During the processing, we also try to delay the read loading data of the dataset for better performance. The data schema is the description about the dataset and the metadata of data index. One dataset may have more than one data schema. Each schema contains the self-described dataset structure and the metadata of the index. Since each row of data in the dataset has its own sematic meaning, the self-described dataset structure will present the position-sensitive or separate-sensitive meaning of the row data. The metadata of the index includes the availability of the index in each sematic meaning section of the dataset, the name, replicas, position of the index block. With the metadata of the index, we can easily find where is the index block. The data schema can be added when we import or first load the dataset, but we can also add more schemas after that. The data index gives the metadata of the data blocks. With the index system, we can locate the exact position of the data blocks, so we can just start the minimal numbers of jobs to read the dataset, which reduces the total running jobs in the distributed storage and processing system. The data index can be built and updated dynamically, and based on the availability of the resource and the statistics of the query, we can choose how many indexes we need and update the index during the off peak time of the system. The knowledge graph schema is very similar to the data schema. It is the description about knowledge heterogeneous graph and the metadata of the knowledge index. The information of the knowledge graph has three parts: node type, node and edge. Each category of the knowledge will be described as a node type; the instance of knowledge will be regarded as a knowledge node; and the edge will be the linkage of the two knowledge instances. The metadata of the knowledge index indicates whether the knowledge is indexed or not. The knowledge graph node stores the knowledge graph. We regard knowledge as a heterogeneous graph; the knowledge graph node contains the knowledge atomic item and the linkage between knowledge atomic items. In order to avoid the duplication of data set, the knowledge node will only contains one or several pointers, which point to the dataset. 4 Details of Design 4.1 Name Node The name node in our multi-indexed distributed knowledge management storage system is nearly the same as GFS name node. In GFS, the name node contain the metadata of the data node, but in our multi-indexed distributed knowledge management storage system, the index node contains the block metadata and a hash table of the logic block metadata and physical metadata. This means that in the event of hardware failure we do not need to update the index node.

5 296 H. Zhu et al. 4.2 Data Schema Node and Data Index Node The data schema contains the semantic description of the data set. One dataset can have more than one schema. The schema is the description of the sematic meaning of the data set. It shows what kinds of the properties are in the data set, the metadata of the properties, the availability of index for each property and the metadata information of index for each property. A typical data schema file will contain six parts: dataset name, dataset description, property names, property description, property separation metadata and property index metadata. For the property separation metadata, it will support the separators, allies, the prefix and postfix, and the combination and iteration of the three. For the property index metadata, if the index is available, it will point to the data index block; if it is not available, it will point to the dataset block. The data index contains the index information for each property; the index is full text index for the property. The value of the full text index is the data blocks for the key. The data index will be stored as a B+ tree for better search performance. 4.3 Knowledge Graph Schema Node Nowadays, knowledge is interconnected and interacts, forming numerous, large, interconnected and sophisticated networks. We regard it as a heterogeneous graph. In our knowledge storage system, we also add the graph schema and graph index into the system. The knowledge graph schema contains the sematic description of the knowledge. It includes the data schema to be used to support the knowledge, the property of the knowledge, the linkage between the knowledge and the linkage between the knowledge property and dataset property. A typical knowledge schema file will contain eight parts: knowledge names, knowledge description, proper names, property description, property metadata, linkage name, linkage description and linkage metadata. For the property metadata, it will contain the linkage between the knowledge property and the data schema property. For the linkage metadata, it will contain on which condition the node will have edge between and the direction of the edge. 4.4 Knowledge Graph Node The knowledge graph node contains information about the knowledge schema, the knowledge itself and the linkage between them. In order to avoid the duplication of the dataset, we will only store the knowledge id, node/edge type and the serious pointers, which point to the related info of row data in each related dataset. There is no data duplication in the knowledge graph system. If there is a query by graph id, we can go to the dataset directly, if not, based on the knowledge graph schema, we will transfer the query to a serious data schema query. Since most of the knowledge graph is very sparse, we will use adjacency list to store the knowledge graph.

6 Multi-Indexed Graph Based Knowledge Storage System Query Process Analysis Figure 2 shows a query process in multi-indexed storage system in knowledge management. HDFS Clinet 1: open 3: open schema 5: read Distributed FileSystem SSData InputStream 2: get schema locations 4: get block locations NameNode Name Node Client JVM Client Node 8: close FSData InputStream 6: read 7: read Metadata (Data info) Knowledge Graph Graph Schema 1 Graph Schema 2... Data Schema 1 Data Schema 2... DataSet 1.1 DataSet 2.1 DataSet 1.n DataSet 2.n Graph Node (Knowledge Graph node) Graph Schema n Schema Node (Knowledge Graph Schema node) Data Schema n Schema Node (Data Schema node)... DataSet n DataSet n.n Metadata (Block info) Data Index Data Node (DataSet node) Data Node (DataSet node) Index Node (Data Index node) Fig. 2. Query Process in Multi-Indexed Storage System in Knowledge Management The HDFS (Hadoop Distributed File System) Client first sends an open request to the distributed file system (step 1 in fig. 2), then the request goes to the Name Node (step 2 in fig. 2), depending on the open request sent to open a dataset or a graph, the name node returns the proper schema node based on the file name/graph name. The HDFS Client then sends the query to the schema stream (SSData)(step 3 in fig. 2). The SSData queries the schema node for the block location of the dataset (step 4 in fig. 2). If it is a query for a graph, the graph schema node will follow the graph schema description; find the data schema node, then go to the data index to find the exact data block information. Alternatively it will find the block information directly if it is a query of graph node id. The data block information will then be returned to the HDFS Client. After the HDFS Client gets the data block information, it will send the query to file stream (FSData)(step 5 in fig. 2). The FSData then goes to the exact data block to read all of the dataset it needs (step 6, 7 in fig. 2). 6 Performance Optimization Distribution issues: the whole multi-indexed storage system is based on the Hadoop distributed file system. We add two new types of node, the schema node and index

7 298 H. Zhu et al. node. They act very similarly with the data node; they all need to report the heart beat to the name node. We still keep the Master/Slave architecture for the whole system. Different to some Lucene based index systems; we separate the index and schema from the original dataset for better scalability and control. Thus we can simply limit the volume of the index, also based on the statistical information of the query, we can dynamically change how much index we need and separate the update of index from the update of the dataset. Performance of execution: The purpose of adding an index to the system is to reduce the total number of jobs running on the knowledge management system. We try to locate the exact block before the map-reduce job is started. We also try to delay the real loading of the dataset. A lot of processing such as set join, merge and difference can be done at the index level[8]. Compared to the non-indexed Hadoop system, we increase the work of building the index and find the block info in the schema-index system. As we described before, the building of the index is a separate job now and it can be done during the off-peak time of the system. The finding of the index will add data loading and process from schema node and index value before map-reduce is started, which will add the traffic of the internal network of Hadoop system. This is acceptable since the size of the load is small. If we hit in the index, we can reduce a lot of jobs for map processing. Performance of building the index: We will add a default data schema to all the dataset, which marked as no attribute and all the blocks as index value. If we are querying an existing data schema with no index for a particular attribute, we will also add all the blocks of the dataset as its value. Through these, the index system can be work even if there is no index. So we can delay the building of index and based on the statistic info, we can build the index dynamically. 7 Experimental Evaluation This new application is based on previous work relating to a transaction fraud detection system. In this application data is obtained from an internal transaction record system; data is also obtained from social networks and third party organizations that are related to the account holder. We then build a heterogeneous knowledge management system in order to analyze the historical data and find potential patterns for fraudulent transactions. In the prototype of this system, we analyzed over 50,000 accounts related to over 2million transactions, combining the data with data from social network records, ip addresses and addresses. The heterogeneous knowledge analysis was then undertaken for three different node types: Create Account, Cash Transaction and Credit Transaction. We add index to the transaction dataset and social network dataset. The x-axis of figure 3 and figure 4 is the paralleled query numbers; the y-axis is the execution time in Second. In figure 3, only 10% attributes are indexed and 80% indexed in figure 4.

8 Multi-Indexed Graph Based Knowledge Storage System 299 Fig. 3. The end-end execution time for parallelized query (10% indexed) Fig. 4. The end-end execution time for parallelized query (80% indexed) From the end-end execution time for parallelized query, we can see that in figure 3, because only 20% indexed, the performance of indexed system is not very different from the original Hadoop system. But in figure 4, the performance will be largely

9 300 H. Zhu et al. increased if the system overall load is not heavy, the heavier system load, the increase of the performance gain in not that significant. 8 Related Work The integration of full-text indexing within a relational database is not a new idea; Oracle, IBM and Microsoft have done a lot of work along these lines. Jimmy Lin et al. [2] from twitter give a full-text index to optimize selection operations on text fields within records. M. Cafarella and C. R_e also make a solution for optimizes selection operations in Hadoop programs. [9] Hadoop++ [10] also injects trojan indexes into Hadoop input splits at data loading time [10]. Haojun Liao [1] et al. also give a R+ based full text index solutions on data node. Some research is attempting to bridge relational databases and Map-Reduce programming models. Examples include an extension of the original Map-Reduce model called MapReduceMerge [3] to better support relational operations, HadoopDB[4], an architectural hybrid that integrates Hadoop with PostgresSQL, and Dremel[5], which takes advantage of columnar compression for large-scale data analysis. Other research are adding index for graph system. Interval labeling [11] and 2HOP labeling [12] are the typical solutions on this field. The interval labeling approaches use min-post-labeling or pre-post-labeling on a spanning subtree of the DAG. In the 2HOP index, each node determines a set of intermediate nodes it can reach, and a set of intermediate nodes that can reach it. 9 Conclusion and Future Work In the multi-indexed knowledge management system we propose, we add sematic meaning into the dataset, build heterogeneous knowledge graph and add index to optimize for the query performance. In the future research, we will focus on performance turning such as how to improve the performance of the index building and delayed data load in the storage system; we will also focus on how to use the indexed storage system to optimize the knowledge management and sharing process. Acknowledgement. This research is supported by NSFC No , the Science and Technology Commission of Shanghai Municipality funding for the Research on Cloud based Data Analysis and Processing in the internet of things (No ). We would like to express our sincere thanks to them.

10 Multi-Indexed Graph Based Knowledge Storage System 301 References 1. Liao, H., Han, J., Fang, J.: Multi-dimensional Index on Hadoop Distributed File System. In: IEEE Fifth International Conference on Networking, Architecture and Storage (NAS) (2010) 2. Lin, J., Ryaboy, D., Weil, K.: Full-text indexing for optimizing selection operations in large-scale data analytics. In: Proceedings of the Second International Workshop on MapReduce and Its Applications (2011) 3. Yang, H., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-Reduce-Merge: Simplified relational data, processing on large clusters. In: SIGMOD (2007) 4. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: VLDB (2009) 5. Melnik, S., Gubarev, A., Long, J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: Interactive analysis of web-scale datasets. In: VLDB (2010) 6. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: Bigtable: A distributed storage system for structured data. In: OSDI (2006) 7. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI (2004) 8. Abadi, D.J.: Materialization strategies in a Column oriented DBMS, ICDE, Istanbul, Turkey, pp (2007) 9. Cafarella, M., Ré, C.: Manimal: Relational optimization for data-intensive programs. In: WebDB (2010) 10. Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). In: VLDB (2010) 11. Agrawal, R., Borgida, A., Jagadish, H.V.: Efficient management of transitive relationships in large data and knowledge bases. SIGMOD Rec. 18(2), (1989) 12. Cheng, J., Yu, J.X., Lin, X., Wang, H., Yu, P.S.: Fast computing reachability labelings for large graphs with high compression rate. In: EBDT (2008)

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment DEIM Forum 213 F2-1 Adaptive indexing 153 855 4-6-1 E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with

More information

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report Large Scale OLAP Yifu Huang 2014/11/4 MAST612117 Scientific English Writing Report 2014 1 Preliminaries OLAP On-Line Analytical Processing Traditional solutions: data warehouses built by parallel databases

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,

More information

HadoopDB: An open source hybrid of MapReduce

HadoopDB: An open source hybrid of MapReduce HadoopDB: An open source hybrid of MapReduce and DBMS technologies Azza Abouzeid, Kamil Bajda-Pawlikowski Daniel J. Abadi, Avi Silberschatz Yale University http://hadoopdb.sourceforge.net October 2, 2009

More information

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Efficient Map Reduce Model with Hadoop Framework for Data Processing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent

More information

How to survive the Data Deluge: Petabyte scale Cloud Computing

How to survive the Data Deluge: Petabyte scale Cloud Computing How to survive the Data Deluge: Petabyte scale Cloud Computing Gianmarco De Francisci Morales IMT Institute for Advanced Studies Lucca CSE PhD XXIV Cycle 18 Jan 2010 1 Outline Part 1: Introduction What,

More information

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Bigtable. Presenter: Yijun Hou, Yixiao Peng Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng

More information

1. Introduction to MapReduce

1. Introduction to MapReduce Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.

More information

Dremel: Interactice Analysis of Web-Scale Datasets

Dremel: Interactice Analysis of Web-Scale Datasets Dremel: Interactice Analysis of Web-Scale Datasets By Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by: Alex Zahdeh 1 / 32 Overview

More information

MINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY

MINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY MINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY *S. ANUSUYA,*R.B. ARUNA,*V. DEEPASRI,**DR.T. AMITHA *UG Students, **Professor Department Of Computer Science and Engineering Dhanalakshmi College of

More information

CSE-E5430 Scalable Cloud Computing Lecture 9

CSE-E5430 Scalable Cloud Computing Lecture 9 CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 17

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 17 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 17 Cloud Data Management VII (Column Stores and Intro to NewSQL) Demetris Zeinalipour http://www.cs.ucy.ac.cy/~dzeina/courses/epl646

More information

Load Balancing Through Map Reducing Application Using CentOS System

Load Balancing Through Map Reducing Application Using CentOS System Load Balancing Through Map Reducing Application Using CentOS System Nidhi Sharma Research Scholar, Suresh Gyan Vihar University, Jaipur (India) Bright Keswani Associate Professor, Suresh Gyan Vihar University,

More information

Dremel: Interactive Analysis of Web- Scale Datasets

Dremel: Interactive Analysis of Web- Scale Datasets Dremel: Interactive Analysis of Web- Scale Datasets S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton Google Inc. VLDB 200 Presented by Ke Hong (slide adapted from Melnik s) Outline Problem

More information

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin and Avi Silberschatz Presented by

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Apache Drill. Interactive Analysis of Large-Scale Datasets. Tomer Shiran

Apache Drill. Interactive Analysis of Large-Scale Datasets. Tomer Shiran Apache Drill Interactive Analysis of Large-Scale Datasets Tomer Shiran Latency Matters Ad-hoc analysis with interactive tools Real-time dashboards Event/trend detection Network intrusions Fraud Failures

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Distributed Systems [Fall 2012]

Distributed Systems [Fall 2012] Distributed Systems [Fall 2012] Lec 20: Bigtable (cont ed) Slide acks: Mohsen Taheriyan (http://www-scf.usc.edu/~csci572/2011spring/presentations/taheriyan.pptx) 1 Chubby (Reminder) Lock service with a

More information

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

A Cloud Computing Implementation of XML Indexing Method Using Hadoop

A Cloud Computing Implementation of XML Indexing Method Using Hadoop A Cloud Computing Implementation of XML Indexing Method Using Hadoop Wen-Chiao Hsu 1, I-En Liao 2,**, and Hsiao-Chen Shih 3 1,2,3 Department of Computer Science and Engineering National Chung-Hsing University,

More information

DCODE: A Distributed Column-Oriented Database Engine for Big Data Analytics

DCODE: A Distributed Column-Oriented Database Engine for Big Data Analytics DCODE: A Distributed Column-Oriented Database Engine for Big Data Analytics Yanchen Liu, Fang Cao, Masood Mortazavi, Mengmeng Chen, Ning Yan, Chi Ku, Aniket Adnaik, Stephen Morgan, Guangyu Shi, Yuhu Wang,

More information

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

L22: NoSQL. CS3200 Database design (sp18 s2)   4/5/2018 Several slides courtesy of Benny Kimelfeld L22: NoSQL CS3200 Database design (sp18 s2) https://course.ccs.neu.edu/cs3200sp18s2/ 4/5/2018 Several slides courtesy of Benny Kimelfeld 2 Outline 3 Introduction Transaction Consistency 4 main data models

More information

New research on Key Technologies of unstructured data cloud storage

New research on Key Technologies of unstructured data cloud storage 2017 International Conference on Computing, Communications and Automation(I3CA 2017) New research on Key Technologies of unstructured data cloud storage Songqi Peng, Rengkui Liua, *, Futian Wang State

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

SURVEY ON BIG DATA TECHNOLOGIES

SURVEY ON BIG DATA TECHNOLOGIES SURVEY ON BIG DATA TECHNOLOGIES Prof. Kannadasan R. Assistant Professor Vit University, Vellore India kannadasan.r@vit.ac.in ABSTRACT Rahis Shaikh M.Tech CSE - 13MCS0045 VIT University, Vellore rais137123@gmail.com

More information

Cassandra- A Distributed Database

Cassandra- A Distributed Database Cassandra- A Distributed Database Tulika Gupta Department of Information Technology Poornima Institute of Engineering and Technology Jaipur, Rajasthan, India Abstract- A relational database is a traditional

More information

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics

Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics EDIC RESEARCH PROPOSAL 1 Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics Ioannis Klonatos DATA, I&C, EPFL Abstract High scalability is becoming an essential requirement

More information

50 Must Read Hadoop Interview Questions & Answers

50 Must Read Hadoop Interview Questions & Answers 50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

MRBench : A Benchmark for Map-Reduce Framework

MRBench : A Benchmark for Map-Reduce Framework MRBench : A Benchmark for Map-Reduce Framework Kiyoung Kim, Kyungho Jeon, Hyuck Han, Shin-gyu Kim, Hyungsoo Jung, Heon Y. Yeom School of Computer Science and Engineering Seoul National University Seoul

More information

THE data generated and stored by enterprises are in the

THE data generated and stored by enterprises are in the EQUI-DEPTH HISTOGRAM CONSTRUCTION FOR BIG DATA WITH QUALITY GUARANTEES 1 Equi-depth Histogram Construction for Big Data with Quality Guarantees Burak Yıldız, Tolga Büyüktanır, and Fatih Emekci arxiv:166.5633v1

More information

LITERATURE SURVEY (BIG DATA ANALYTICS)!

LITERATURE SURVEY (BIG DATA ANALYTICS)! LITERATURE SURVEY (BIG DATA ANALYTICS) Applications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer

More information

An Efficient Distributed B-tree Index Method in Cloud Computing

An Efficient Distributed B-tree Index Method in Cloud Computing Send Orders for Reprints to reprints@benthamscience.ae The Open Cybernetics & Systemics Journal, 214, 8, 32-38 32 Open Access An Efficient Distributed B-tree Index Method in Cloud Computing Huang Bin 1,*

More information

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering

More information

arxiv: v1 [cs.db] 2 Oct 2018

arxiv: v1 [cs.db] 2 Oct 2018 Heterogeneous Replica for Query on Cassandra Jialin Qiao, Xiangdong Huang, Lei Rui, Jianmin Wang Tsinghua University Beijing, China qjl16,ruil14@mails.tsinghua.edu.cn,huangxdong,jimwang@tsinghua.edu.cn

More information

Hadoop and HDFS Overview. Madhu Ankam

Hadoop and HDFS Overview. Madhu Ankam Hadoop and HDFS Overview Madhu Ankam Why Hadoop We are gathering more data than ever Examples of data : Server logs Web logs Financial transactions Analytics Emails and text messages Social media like

More information

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1 MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

Column Stores and HBase. Rui LIU, Maksim Hrytsenia Column Stores and HBase Rui LIU, Maksim Hrytsenia December 2017 Contents 1 Hadoop 2 1.1 Creation................................ 2 2 HBase 3 2.1 Column Store Database....................... 3 2.2 HBase

More information

A Review to the Approach for Transformation of Data from MySQL to NoSQL

A Review to the Approach for Transformation of Data from MySQL to NoSQL A Review to the Approach for Transformation of Data from MySQL to NoSQL Monika 1 and Ashok 2 1 M. Tech. Scholar, Department of Computer Science and Engineering, BITS College of Engineering, Bhiwani, Haryana

More information

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop Myoungjin Kim 1, Seungho Han 1, Jongjin Jung 3, Hanku Lee 1,2,*, Okkyung Choi 2 1 Department of Internet and Multimedia Engineering,

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some

More information

Fast and Effective System for Name Entity Recognition on Big Data

Fast and Effective System for Name Entity Recognition on Big Data International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-3, Issue-2 E-ISSN: 2347-2693 Fast and Effective System for Name Entity Recognition on Big Data Jigyasa Nigam

More information

Adapting Skyline Computation to the MapReduce Framework: Algorithms and Experiments

Adapting Skyline Computation to the MapReduce Framework: Algorithms and Experiments Adapting Skyline Computation to the MapReduce Framework: Algorithms and Experiments Boliang Zhang 1,ShuigengZhou 1, and Jihong Guan 2 1 School of Computer Science, and Shanghai Key Lab of Intelligent Information

More information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information Subject 10 Fall 2015 Google File System and BigTable and tiny bits of HDFS (Hadoop File System) and Chubby Not in textbook; additional information Disclaimer: These abbreviated notes DO NOT substitute

More information

Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution

Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution Vol:6, No:1, 212 Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution Nchimbi Edward Pius, Liu Qin, Fion Yang, Zhu Hong Ming International Science Index, Computer and Information Engineering

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang

Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang International Conference on Engineering Management (Iconf-EM 2016) Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang School of

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Dremel: Interactive Analysis of Web-Scale Datasets

Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by: Sameer Agarwal sameerag@cs.berkeley.edu

More information

Query processing on raw files. Vítor Uwe Reus

Query processing on raw files. Vítor Uwe Reus Query processing on raw files Vítor Uwe Reus Outline 1. Introduction 2. Adaptive Indexing 3. Hybrid MapReduce 4. NoDB 5. Summary Outline 1. Introduction 2. Adaptive Indexing 3. Hybrid MapReduce 4. NoDB

More information

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09 Presented by: Daniel Isaacs It all starts with cluster computing. MapReduce Why

More information

A computational model for MapReduce job flow

A computational model for MapReduce job flow A computational model for MapReduce job flow Tommaso Di Noia, Marina Mongiello, Eugenio Di Sciascio Dipartimento di Ingegneria Elettrica e Dell informazione Politecnico di Bari Via E. Orabona, 4 70125

More information

Data Prefetching for Scientific Workflow Based on Hadoop

Data Prefetching for Scientific Workflow Based on Hadoop Data Prefetching for Scientific Workflow Based on Hadoop Gaozhao Chen, Shaochun Wu, Rongrong Gu, Yongquan Xu, Lingyu Xu, Yunwen Ge, and Cuicui Song * Abstract. Data-intensive scientific workflow based

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Things To Know. When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich

Things To Know. When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich 7 Things To Know When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich 1 What Shoes? Why Shoes? 3 Analyzing MR Jobs (HadoopToSQL, Manimal) Generating MR Jobs (PigLatin, Hive) Executing MR Jobs

More information

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

Ghislain Fourny. Big Data 5. Column stores

Ghislain Fourny. Big Data 5. Column stores Ghislain Fourny Big Data 5. Column stores 1 Introduction 2 Relational model 3 Relational model Schema 4 Issues with relational databases (RDBMS) Small scale Single machine 5 Can we fix a RDBMS? Scale up

More information

BigTable: A Distributed Storage System for Structured Data

BigTable: A Distributed Storage System for Structured Data BigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 DOI:10.19026/ajfst.5.3106 ISSN: 2042-4868; e-issn: 2042-4876 2013 Maxwell Scientific Publication Corp. Submitted: May 29, 2013 Accepted:

More information

Research and Improvement of Apriori Algorithm Based on Hadoop

Research and Improvement of Apriori Algorithm Based on Hadoop Research and Improvement of Apriori Algorithm Based on Hadoop Gao Pengfei a, Wang Jianguo b and Liu Pengcheng c School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021,

More information

CLIENT DATA NODE NAME NODE

CLIENT DATA NODE NAME NODE Volume 6, Issue 12, December 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Efficiency

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network

More information

Lessons Learned While Building Infrastructure Software at Google

Lessons Learned While Building Infrastructure Software at Google Lessons Learned While Building Infrastructure Software at Google Jeff Dean jeff@google.com Google Circa 1997 (google.stanford.edu) Corkboards (1999) Google Data Center (2000) Google Data Center (2000)

More information

Journal of East China Normal University (Natural Science) Data calculation and performance optimization of dairy traceability based on Hadoop/Hive

Journal of East China Normal University (Natural Science) Data calculation and performance optimization of dairy traceability based on Hadoop/Hive 4 2018 7 ( ) Journal of East China Normal University (Natural Science) No. 4 Jul. 2018 : 1000-5641(2018)04-0099-10 Hadoop/Hive 1, 1, 1, 1,2, 1, 1 (1., 210095; 2., 210095) :,, Hadoop/Hive, Hadoop/Hive.,,

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

A New Model of Search Engine based on Cloud Computing

A New Model of Search Engine based on Cloud Computing A New Model of Search Engine based on Cloud Computing DING Jian-li 1,2, YANG Bo 1 1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China 2. Tianjin Key

More information

Data Sharing Made Easier through Programmable Metadata. University of Wisconsin-Madison

Data Sharing Made Easier through Programmable Metadata. University of Wisconsin-Madison Data Sharing Made Easier through Programmable Metadata Zhe Zhang IBM Research! Remzi Arpaci-Dusseau University of Wisconsin-Madison How do applications share data today? Syncing data between storage systems:

More information

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1 International Conference on Intelligent Systems Research and Mechatronics Engineering (ISRME 2015) The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian

More information

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c 2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c

The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c Applied Mechanics and Materials Online: 2013-09-27 ISSN: 1662-7482, Vols. 423-426, pp 2733-2736 doi:10.4028/www.scientific.net/amm.423-426.2733 2013 Trans Tech Publications, Switzerland The Design of Distributed

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Evaluation of Multiple Fat-Btrees on a Parallel Database

Evaluation of Multiple Fat-Btrees on a Parallel Database DEIM Forum 2012 D10-3 Evaluation of Multiple Fat-Btrees on a Parallel Database Min LUO Takeshi MISHIMA Haruo YOKOTA Dept. of Computer Science, Tokyo Institute of Technology 2-12-1 Ookayama, Meguro-ku,

More information

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud Calhoun: The NPS Institutional Archive Faculty and Researcher Publications Faculty and Researcher Publications 2013-03 A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

SQL Query Optimization on Cross Nodes for Distributed System

SQL Query Optimization on Cross Nodes for Distributed System 2016 International Conference on Power, Energy Engineering and Management (PEEM 2016) ISBN: 978-1-60595-324-3 SQL Query Optimization on Cross Nodes for Distributed System Feng ZHAO 1, Qiao SUN 1, Yan-bin

More information

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA

More information

A Review Paper on Big data & Hadoop

A Review Paper on Big data & Hadoop A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information