Multi-indexed Graph Based Knowledge Storage System
|
|
- Ethelbert Small
- 5 years ago
- Views:
Transcription
1 Multi-indexed Graph Based Knowledge Storage System Hongming Zhu 1,2, Danny Morton 2, Wenjun Zhou 3, Qin Liu 1, and You Zhou 1 1 School of software engineering, Tongji University, China {zhu_hongming,qin.liu}@tongji.edu.cn, zyrusher@gmail.com 2 Well-Being and Social Sciences, Bolton University, UK dm3@bolton.ac.uk 3 School of computer science, Tongji University, China zhouwenjun77@hotmail.com Abstract. With the rapid development of information technologies, knowledge management systems are facing the problem of how to manage the massive volume of data and make an efficient usage for the data. In this paper, we analyzed the current challenge of knowledge management system and proposed a multi-indexed graph based knowledge storage system to avoid the data duplication and optimized for parallel processing. Keywords: Index, Storage System, Knowledge Management. 1 Introduction With the development of information technologies, access to datasets from many different sources is an important part of any knowledge management process. It is still very difficult, however, to extract the exact meaning, or related attribute, of the knowledge within the dataset. This paper explores the framework for the management, coding, and analysis of large heterogeneous datasets. In any knowledge management system, we usually find that the dataset comes from different sources with different formats, and varying degrees of structure. Some of the datasets are well structured, but there is also a lot of unstructured data arising from sources such as, wiki data, web data and social network data. The application of multiple sematic meanings to the same dataset, means that different applications may interpret or use the dataset in different ways. The objective of the research described in this paper was to design a framework of data storage, which would help overcome these problems in knowledge management systems. 2 Motivation 2.1 The Knowledge Management System Should Be Scalable and Optimized for Massive Dataset Analysis With the massive explosion of data in knowledge management system, we have to handle large volumes of dataset during data analysis. How to avoid the data duplication Z. Huang et al. (Eds.): WISE 2013 Workshops 2013, LNCS 8182, pp , Springer-Verlag Berlin Heidelberg 2014
2 Multi-Indexed Graph Based Knowledge Storage System 293 in different knowledge analysis project and how to optimized for parallel process of the data becomes the key challenges in modern knowledge management system. A unified, scalable and parallel storage architecture is needed for knowledge management system. 2.2 Structured, Semi-structured and Unstructured Data Can Be Managed in the Same Way In most of the knowledge management systems, the datasets will come from individuals or organizations. Some of these datasets will be well structured and based on the requirements of the originator, but it will not necessarily be very suitable for other analysis and processing around a different requirement. Indeed some datasets will be semi-structured and even unstructured, such as web data and social network datasets. The key is to design a storage and access system, which will work in the same way for all different types of dataset irrespective of their source and format. 2.3 Sematic Description Can Be Easily Added in the Dataset In any knowledge management system, it is very difficult to define a proper schema for a data set and never change it. In different applications, we may focus on different schemas for the same dataset, and we may need to add more sematic meanings to the dataset during the process of analysis. For example, when dealing with a customer referral system, we may regard the address as a whole sentence but when dealing with a logistical application we may need to divide the address into countries, states, streets etc., in order to quantify the data. Within the storage system, one dataset may have several schemas depending on the application. 2.4 Sematic Meanings of Knowledge Can Be Easily Added in the Graph Based Knowledge Management System In a knowledge management system, we will often have different definitions for the knowledge. Knowledge is usually defined by certain properties and there may also be linkages between different parts of the knowledge dataset. In order to describe the different sematic meanings of knowledge, we will regard a particular knowledge as a knowledge node type, atomic knowledge as an instance of a knowledge type node. For example, technical publications in a particular field can be regarded as a knowledge node type ; a single publication in this field can be an instance of the node type. The edge is defined as the linkage between the knowledge nodes. For example, papers written by the same author. Thus we can build a heterogeneous graph for knowledge management problems. We will import the idea of a knowledge schema to describe the heterogeneous graph in the knowledge management system. 2.5 The Storage System Can Be Scalable and Have Good Performance in Knowledge Processing With an increasing number of datasets within a knowledge management system, we have to think about the scalability, performance, and the processing of the data.
3 294 H. Zhu et al. Distributed storage and processing is a potential solution for this problem. In order to get better performance, we do need to add some index in the distributed storage and process system. The index needs to be easily scaled. We can add and update an index based on graph schema, dataset schema, and the dataset. The index also needs to be processed in parallel for better performance. We will import the idea of index and delayed loading of a dataset into the knowledge management system. 3 Design Overview The distributed storage system we proposed is based on the Google distributed file system GFS [6]. It contains four parts: data nodes, which contains the dataset; data schema nodes and graph knowledge schema nodes, which contain the dataset schema and graph schema; index nodes, which contain the index for the dataset; graph knowledge nodes, which contain the knowledge graph. Figure 1 shows the architecture of the proposed multi indexed knowledge management storage system. Read ops NameNode Client Read Metadata ops Graph ops Graph Nodes Block ops SchemaNodes Index ops Metadata (Name, replicas,...) /home/foo/data,3,... Schema Blocks SchemaNodes Metadata (Name, replicas,...) /home/foo/data,3,... IndexNodes IndexNodes Blocks Replication DataNodes DataNodes DataNodes DataNodes DataNodes Rack 1 Rack 2 Write Client Fig. 1. The Logical Architecture of Multi-Indexed Storage System in Knowledge Management The dataset is stored as a pure row data in the distributed file system based on Hadoop. Hadoop [7] is an open source software framework that supports data intensive distributed applications, is designed for big data processing, and supports the running
4 Multi-Indexed Graph Based Knowledge Storage System 295 of applications on large clusters of commodity hardware. We add index features in the Hadoop system to achieve the knowledge management system. We regard the dataset, no matter whether it is a structured data or unstructured data, as files, which will be stored in the Hadoop system as row files. During the processing, we also try to delay the read loading data of the dataset for better performance. The data schema is the description about the dataset and the metadata of data index. One dataset may have more than one data schema. Each schema contains the self-described dataset structure and the metadata of the index. Since each row of data in the dataset has its own sematic meaning, the self-described dataset structure will present the position-sensitive or separate-sensitive meaning of the row data. The metadata of the index includes the availability of the index in each sematic meaning section of the dataset, the name, replicas, position of the index block. With the metadata of the index, we can easily find where is the index block. The data schema can be added when we import or first load the dataset, but we can also add more schemas after that. The data index gives the metadata of the data blocks. With the index system, we can locate the exact position of the data blocks, so we can just start the minimal numbers of jobs to read the dataset, which reduces the total running jobs in the distributed storage and processing system. The data index can be built and updated dynamically, and based on the availability of the resource and the statistics of the query, we can choose how many indexes we need and update the index during the off peak time of the system. The knowledge graph schema is very similar to the data schema. It is the description about knowledge heterogeneous graph and the metadata of the knowledge index. The information of the knowledge graph has three parts: node type, node and edge. Each category of the knowledge will be described as a node type; the instance of knowledge will be regarded as a knowledge node; and the edge will be the linkage of the two knowledge instances. The metadata of the knowledge index indicates whether the knowledge is indexed or not. The knowledge graph node stores the knowledge graph. We regard knowledge as a heterogeneous graph; the knowledge graph node contains the knowledge atomic item and the linkage between knowledge atomic items. In order to avoid the duplication of data set, the knowledge node will only contains one or several pointers, which point to the dataset. 4 Details of Design 4.1 Name Node The name node in our multi-indexed distributed knowledge management storage system is nearly the same as GFS name node. In GFS, the name node contain the metadata of the data node, but in our multi-indexed distributed knowledge management storage system, the index node contains the block metadata and a hash table of the logic block metadata and physical metadata. This means that in the event of hardware failure we do not need to update the index node.
5 296 H. Zhu et al. 4.2 Data Schema Node and Data Index Node The data schema contains the semantic description of the data set. One dataset can have more than one schema. The schema is the description of the sematic meaning of the data set. It shows what kinds of the properties are in the data set, the metadata of the properties, the availability of index for each property and the metadata information of index for each property. A typical data schema file will contain six parts: dataset name, dataset description, property names, property description, property separation metadata and property index metadata. For the property separation metadata, it will support the separators, allies, the prefix and postfix, and the combination and iteration of the three. For the property index metadata, if the index is available, it will point to the data index block; if it is not available, it will point to the dataset block. The data index contains the index information for each property; the index is full text index for the property. The value of the full text index is the data blocks for the key. The data index will be stored as a B+ tree for better search performance. 4.3 Knowledge Graph Schema Node Nowadays, knowledge is interconnected and interacts, forming numerous, large, interconnected and sophisticated networks. We regard it as a heterogeneous graph. In our knowledge storage system, we also add the graph schema and graph index into the system. The knowledge graph schema contains the sematic description of the knowledge. It includes the data schema to be used to support the knowledge, the property of the knowledge, the linkage between the knowledge and the linkage between the knowledge property and dataset property. A typical knowledge schema file will contain eight parts: knowledge names, knowledge description, proper names, property description, property metadata, linkage name, linkage description and linkage metadata. For the property metadata, it will contain the linkage between the knowledge property and the data schema property. For the linkage metadata, it will contain on which condition the node will have edge between and the direction of the edge. 4.4 Knowledge Graph Node The knowledge graph node contains information about the knowledge schema, the knowledge itself and the linkage between them. In order to avoid the duplication of the dataset, we will only store the knowledge id, node/edge type and the serious pointers, which point to the related info of row data in each related dataset. There is no data duplication in the knowledge graph system. If there is a query by graph id, we can go to the dataset directly, if not, based on the knowledge graph schema, we will transfer the query to a serious data schema query. Since most of the knowledge graph is very sparse, we will use adjacency list to store the knowledge graph.
6 Multi-Indexed Graph Based Knowledge Storage System Query Process Analysis Figure 2 shows a query process in multi-indexed storage system in knowledge management. HDFS Clinet 1: open 3: open schema 5: read Distributed FileSystem SSData InputStream 2: get schema locations 4: get block locations NameNode Name Node Client JVM Client Node 8: close FSData InputStream 6: read 7: read Metadata (Data info) Knowledge Graph Graph Schema 1 Graph Schema 2... Data Schema 1 Data Schema 2... DataSet 1.1 DataSet 2.1 DataSet 1.n DataSet 2.n Graph Node (Knowledge Graph node) Graph Schema n Schema Node (Knowledge Graph Schema node) Data Schema n Schema Node (Data Schema node)... DataSet n DataSet n.n Metadata (Block info) Data Index Data Node (DataSet node) Data Node (DataSet node) Index Node (Data Index node) Fig. 2. Query Process in Multi-Indexed Storage System in Knowledge Management The HDFS (Hadoop Distributed File System) Client first sends an open request to the distributed file system (step 1 in fig. 2), then the request goes to the Name Node (step 2 in fig. 2), depending on the open request sent to open a dataset or a graph, the name node returns the proper schema node based on the file name/graph name. The HDFS Client then sends the query to the schema stream (SSData)(step 3 in fig. 2). The SSData queries the schema node for the block location of the dataset (step 4 in fig. 2). If it is a query for a graph, the graph schema node will follow the graph schema description; find the data schema node, then go to the data index to find the exact data block information. Alternatively it will find the block information directly if it is a query of graph node id. The data block information will then be returned to the HDFS Client. After the HDFS Client gets the data block information, it will send the query to file stream (FSData)(step 5 in fig. 2). The FSData then goes to the exact data block to read all of the dataset it needs (step 6, 7 in fig. 2). 6 Performance Optimization Distribution issues: the whole multi-indexed storage system is based on the Hadoop distributed file system. We add two new types of node, the schema node and index
7 298 H. Zhu et al. node. They act very similarly with the data node; they all need to report the heart beat to the name node. We still keep the Master/Slave architecture for the whole system. Different to some Lucene based index systems; we separate the index and schema from the original dataset for better scalability and control. Thus we can simply limit the volume of the index, also based on the statistical information of the query, we can dynamically change how much index we need and separate the update of index from the update of the dataset. Performance of execution: The purpose of adding an index to the system is to reduce the total number of jobs running on the knowledge management system. We try to locate the exact block before the map-reduce job is started. We also try to delay the real loading of the dataset. A lot of processing such as set join, merge and difference can be done at the index level[8]. Compared to the non-indexed Hadoop system, we increase the work of building the index and find the block info in the schema-index system. As we described before, the building of the index is a separate job now and it can be done during the off-peak time of the system. The finding of the index will add data loading and process from schema node and index value before map-reduce is started, which will add the traffic of the internal network of Hadoop system. This is acceptable since the size of the load is small. If we hit in the index, we can reduce a lot of jobs for map processing. Performance of building the index: We will add a default data schema to all the dataset, which marked as no attribute and all the blocks as index value. If we are querying an existing data schema with no index for a particular attribute, we will also add all the blocks of the dataset as its value. Through these, the index system can be work even if there is no index. So we can delay the building of index and based on the statistic info, we can build the index dynamically. 7 Experimental Evaluation This new application is based on previous work relating to a transaction fraud detection system. In this application data is obtained from an internal transaction record system; data is also obtained from social networks and third party organizations that are related to the account holder. We then build a heterogeneous knowledge management system in order to analyze the historical data and find potential patterns for fraudulent transactions. In the prototype of this system, we analyzed over 50,000 accounts related to over 2million transactions, combining the data with data from social network records, ip addresses and addresses. The heterogeneous knowledge analysis was then undertaken for three different node types: Create Account, Cash Transaction and Credit Transaction. We add index to the transaction dataset and social network dataset. The x-axis of figure 3 and figure 4 is the paralleled query numbers; the y-axis is the execution time in Second. In figure 3, only 10% attributes are indexed and 80% indexed in figure 4.
8 Multi-Indexed Graph Based Knowledge Storage System 299 Fig. 3. The end-end execution time for parallelized query (10% indexed) Fig. 4. The end-end execution time for parallelized query (80% indexed) From the end-end execution time for parallelized query, we can see that in figure 3, because only 20% indexed, the performance of indexed system is not very different from the original Hadoop system. But in figure 4, the performance will be largely
9 300 H. Zhu et al. increased if the system overall load is not heavy, the heavier system load, the increase of the performance gain in not that significant. 8 Related Work The integration of full-text indexing within a relational database is not a new idea; Oracle, IBM and Microsoft have done a lot of work along these lines. Jimmy Lin et al. [2] from twitter give a full-text index to optimize selection operations on text fields within records. M. Cafarella and C. R_e also make a solution for optimizes selection operations in Hadoop programs. [9] Hadoop++ [10] also injects trojan indexes into Hadoop input splits at data loading time [10]. Haojun Liao [1] et al. also give a R+ based full text index solutions on data node. Some research is attempting to bridge relational databases and Map-Reduce programming models. Examples include an extension of the original Map-Reduce model called MapReduceMerge [3] to better support relational operations, HadoopDB[4], an architectural hybrid that integrates Hadoop with PostgresSQL, and Dremel[5], which takes advantage of columnar compression for large-scale data analysis. Other research are adding index for graph system. Interval labeling [11] and 2HOP labeling [12] are the typical solutions on this field. The interval labeling approaches use min-post-labeling or pre-post-labeling on a spanning subtree of the DAG. In the 2HOP index, each node determines a set of intermediate nodes it can reach, and a set of intermediate nodes that can reach it. 9 Conclusion and Future Work In the multi-indexed knowledge management system we propose, we add sematic meaning into the dataset, build heterogeneous knowledge graph and add index to optimize for the query performance. In the future research, we will focus on performance turning such as how to improve the performance of the index building and delayed data load in the storage system; we will also focus on how to use the indexed storage system to optimize the knowledge management and sharing process. Acknowledgement. This research is supported by NSFC No , the Science and Technology Commission of Shanghai Municipality funding for the Research on Cloud based Data Analysis and Processing in the internet of things (No ). We would like to express our sincere thanks to them.
10 Multi-Indexed Graph Based Knowledge Storage System 301 References 1. Liao, H., Han, J., Fang, J.: Multi-dimensional Index on Hadoop Distributed File System. In: IEEE Fifth International Conference on Networking, Architecture and Storage (NAS) (2010) 2. Lin, J., Ryaboy, D., Weil, K.: Full-text indexing for optimizing selection operations in large-scale data analytics. In: Proceedings of the Second International Workshop on MapReduce and Its Applications (2011) 3. Yang, H., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-Reduce-Merge: Simplified relational data, processing on large clusters. In: SIGMOD (2007) 4. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: VLDB (2009) 5. Melnik, S., Gubarev, A., Long, J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: Interactive analysis of web-scale datasets. In: VLDB (2010) 6. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: Bigtable: A distributed storage system for structured data. In: OSDI (2006) 7. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI (2004) 8. Abadi, D.J.: Materialization strategies in a Column oriented DBMS, ICDE, Istanbul, Turkey, pp (2007) 9. Cafarella, M., Ré, C.: Manimal: Relational optimization for data-intensive programs. In: WebDB (2010) 10. Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). In: VLDB (2010) 11. Agrawal, R., Borgida, A., Jagadish, H.V.: Efficient management of transitive relationships in large data and knowledge bases. SIGMOD Rec. 18(2), (1989) 12. Cheng, J., Yu, J.X., Lin, X., Wang, H., Yu, P.S.: Fast computing reachability labelings for large graphs with high compression rate. In: EBDT (2008)
Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment
DEIM Forum 213 F2-1 Adaptive indexing 153 855 4-6-1 E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with
More informationLarge Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report
Large Scale OLAP Yifu Huang 2014/11/4 MAST612117 Scientific English Writing Report 2014 1 Preliminaries OLAP On-Line Analytical Processing Traditional solutions: data warehouses built by parallel databases
More informationA Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More information4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)
4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,
More informationHadoopDB: An open source hybrid of MapReduce
HadoopDB: An open source hybrid of MapReduce and DBMS technologies Azza Abouzeid, Kamil Bajda-Pawlikowski Daniel J. Abadi, Avi Silberschatz Yale University http://hadoopdb.sourceforge.net October 2, 2009
More informationEfficient Map Reduce Model with Hadoop Framework for Data Processing
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,
More informationΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing
ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent
More informationHow to survive the Data Deluge: Petabyte scale Cloud Computing
How to survive the Data Deluge: Petabyte scale Cloud Computing Gianmarco De Francisci Morales IMT Institute for Advanced Studies Lucca CSE PhD XXIV Cycle 18 Jan 2010 1 Outline Part 1: Introduction What,
More informationBigtable. Presenter: Yijun Hou, Yixiao Peng
Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng
More information1. Introduction to MapReduce
Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.
More informationDremel: Interactice Analysis of Web-Scale Datasets
Dremel: Interactice Analysis of Web-Scale Datasets By Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by: Alex Zahdeh 1 / 32 Overview
More informationMINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY
MINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY *S. ANUSUYA,*R.B. ARUNA,*V. DEEPASRI,**DR.T. AMITHA *UG Students, **Professor Department Of Computer Science and Engineering Dhanalakshmi College of
More informationCSE-E5430 Scalable Cloud Computing Lecture 9
CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 17
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 17 Cloud Data Management VII (Column Stores and Intro to NewSQL) Demetris Zeinalipour http://www.cs.ucy.ac.cy/~dzeina/courses/epl646
More informationLoad Balancing Through Map Reducing Application Using CentOS System
Load Balancing Through Map Reducing Application Using CentOS System Nidhi Sharma Research Scholar, Suresh Gyan Vihar University, Jaipur (India) Bright Keswani Associate Professor, Suresh Gyan Vihar University,
More informationDremel: Interactive Analysis of Web- Scale Datasets
Dremel: Interactive Analysis of Web- Scale Datasets S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton Google Inc. VLDB 200 Presented by Ke Hong (slide adapted from Melnik s) Outline Problem
More informationHadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin and Avi Silberschatz Presented by
More informationParallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce
Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationApache Drill. Interactive Analysis of Large-Scale Datasets. Tomer Shiran
Apache Drill Interactive Analysis of Large-Scale Datasets Tomer Shiran Latency Matters Ad-hoc analysis with interactive tools Real-time dashboards Event/trend detection Network intrusions Fraud Failures
More informationBig Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela
More informationDistributed Systems [Fall 2012]
Distributed Systems [Fall 2012] Lec 20: Bigtable (cont ed) Slide acks: Mohsen Taheriyan (http://www-scf.usc.edu/~csci572/2011spring/presentations/taheriyan.pptx) 1 Chubby (Reminder) Lock service with a
More information18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.
18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File
More informationHADOOP FRAMEWORK FOR BIG DATA
HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further
More informationHigh Performance Computing on MapReduce Programming Framework
International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming
More informationA Cloud Computing Implementation of XML Indexing Method Using Hadoop
A Cloud Computing Implementation of XML Indexing Method Using Hadoop Wen-Chiao Hsu 1, I-En Liao 2,**, and Hsiao-Chen Shih 3 1,2,3 Department of Computer Science and Engineering National Chung-Hsing University,
More informationDCODE: A Distributed Column-Oriented Database Engine for Big Data Analytics
DCODE: A Distributed Column-Oriented Database Engine for Big Data Analytics Yanchen Liu, Fang Cao, Masood Mortazavi, Mengmeng Chen, Ning Yan, Chi Ku, Aniket Adnaik, Stephen Morgan, Guangyu Shi, Yuhu Wang,
More informationL22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld
L22: NoSQL CS3200 Database design (sp18 s2) https://course.ccs.neu.edu/cs3200sp18s2/ 4/5/2018 Several slides courtesy of Benny Kimelfeld 2 Outline 3 Introduction Transaction Consistency 4 main data models
More informationNew research on Key Technologies of unstructured data cloud storage
2017 International Conference on Computing, Communications and Automation(I3CA 2017) New research on Key Technologies of unstructured data cloud storage Songqi Peng, Rengkui Liua, *, Futian Wang State
More informationCLOUD-SCALE FILE SYSTEMS
Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationSURVEY ON BIG DATA TECHNOLOGIES
SURVEY ON BIG DATA TECHNOLOGIES Prof. Kannadasan R. Assistant Professor Vit University, Vellore India kannadasan.r@vit.ac.in ABSTRACT Rahis Shaikh M.Tech CSE - 13MCS0045 VIT University, Vellore rais137123@gmail.com
More informationCassandra- A Distributed Database
Cassandra- A Distributed Database Tulika Gupta Department of Information Technology Poornima Institute of Engineering and Technology Jaipur, Rajasthan, India Abstract- A relational database is a traditional
More information18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.
18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationCombining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics
EDIC RESEARCH PROPOSAL 1 Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics Ioannis Klonatos DATA, I&C, EPFL Abstract High scalability is becoming an essential requirement
More information50 Must Read Hadoop Interview Questions & Answers
50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationMRBench : A Benchmark for Map-Reduce Framework
MRBench : A Benchmark for Map-Reduce Framework Kiyoung Kim, Kyungho Jeon, Hyuck Han, Shin-gyu Kim, Hyungsoo Jung, Heon Y. Yeom School of Computer Science and Engineering Seoul National University Seoul
More informationTHE data generated and stored by enterprises are in the
EQUI-DEPTH HISTOGRAM CONSTRUCTION FOR BIG DATA WITH QUALITY GUARANTEES 1 Equi-depth Histogram Construction for Big Data with Quality Guarantees Burak Yıldız, Tolga Büyüktanır, and Fatih Emekci arxiv:166.5633v1
More informationLITERATURE SURVEY (BIG DATA ANALYTICS)!
LITERATURE SURVEY (BIG DATA ANALYTICS) Applications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer
More informationAn Efficient Distributed B-tree Index Method in Cloud Computing
Send Orders for Reprints to reprints@benthamscience.ae The Open Cybernetics & Systemics Journal, 214, 8, 32-38 32 Open Access An Efficient Distributed B-tree Index Method in Cloud Computing Huang Bin 1,*
More informationHuge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2
2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering
More informationarxiv: v1 [cs.db] 2 Oct 2018
Heterogeneous Replica for Query on Cassandra Jialin Qiao, Xiangdong Huang, Lei Rui, Jianmin Wang Tsinghua University Beijing, China qjl16,ruil14@mails.tsinghua.edu.cn,huangxdong,jimwang@tsinghua.edu.cn
More informationHadoop and HDFS Overview. Madhu Ankam
Hadoop and HDFS Overview Madhu Ankam Why Hadoop We are gathering more data than ever Examples of data : Server logs Web logs Financial transactions Analytics Emails and text messages Social media like
More informationMapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1
MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.
More informationBigData and Map Reduce VITMAC03
BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to
More informationGlobal Journal of Engineering Science and Research Management
A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV
More informationMapReduce. U of Toronto, 2014
MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in
More informationColumn Stores and HBase. Rui LIU, Maksim Hrytsenia
Column Stores and HBase Rui LIU, Maksim Hrytsenia December 2017 Contents 1 Hadoop 2 1.1 Creation................................ 2 2 HBase 3 2.1 Column Store Database....................... 3 2.2 HBase
More informationA Review to the Approach for Transformation of Data from MySQL to NoSQL
A Review to the Approach for Transformation of Data from MySQL to NoSQL Monika 1 and Ashok 2 1 M. Tech. Scholar, Department of Computer Science and Engineering, BITS College of Engineering, Bhiwani, Haryana
More informationA Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop
A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop Myoungjin Kim 1, Seungho Han 1, Jongjin Jung 3, Hanku Lee 1,2,*, Okkyung Choi 2 1 Department of Internet and Multimedia Engineering,
More informationSurvey Paper on Traditional Hadoop and Pipelined Map Reduce
International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationGraph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web
Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some
More informationFast and Effective System for Name Entity Recognition on Big Data
International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-3, Issue-2 E-ISSN: 2347-2693 Fast and Effective System for Name Entity Recognition on Big Data Jigyasa Nigam
More informationAdapting Skyline Computation to the MapReduce Framework: Algorithms and Experiments
Adapting Skyline Computation to the MapReduce Framework: Algorithms and Experiments Boliang Zhang 1,ShuigengZhou 1, and Jihong Guan 2 1 School of Computer Science, and Shanghai Key Lab of Intelligent Information
More informationGoogle File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information
Subject 10 Fall 2015 Google File System and BigTable and tiny bits of HDFS (Hadoop File System) and Chubby Not in textbook; additional information Disclaimer: These abbreviated notes DO NOT substitute
More informationOptimizing Hadoop Block Placement Policy & Cluster Blocks Distribution
Vol:6, No:1, 212 Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution Nchimbi Edward Pius, Liu Qin, Fion Yang, Zhu Hong Ming International Science Index, Computer and Information Engineering
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationInternational Journal of Advance Engineering and Research Development. A Study: Hadoop Framework
Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja
More informationBig Data with Hadoop Ecosystem
Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process
More informationResearch on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang
International Conference on Engineering Management (Iconf-EM 2016) Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang School of
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationDremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by: Sameer Agarwal sameerag@cs.berkeley.edu
More informationQuery processing on raw files. Vítor Uwe Reus
Query processing on raw files Vítor Uwe Reus Outline 1. Introduction 2. Adaptive Indexing 3. Hybrid MapReduce 4. NoDB 5. Summary Outline 1. Introduction 2. Adaptive Indexing 3. Hybrid MapReduce 4. NoDB
More informationAndrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09 Presented by: Daniel Isaacs It all starts with cluster computing. MapReduce Why
More informationA computational model for MapReduce job flow
A computational model for MapReduce job flow Tommaso Di Noia, Marina Mongiello, Eugenio Di Sciascio Dipartimento di Ingegneria Elettrica e Dell informazione Politecnico di Bari Via E. Orabona, 4 70125
More informationData Prefetching for Scientific Workflow Based on Hadoop
Data Prefetching for Scientific Workflow Based on Hadoop Gaozhao Chen, Shaochun Wu, Rongrong Gu, Yongquan Xu, Lingyu Xu, Yunwen Ge, and Cuicui Song * Abstract. Data-intensive scientific workflow based
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationThings To Know. When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich
7 Things To Know When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich 1 What Shoes? Why Shoes? 3 Analyzing MR Jobs (HadoopToSQL, Manimal) Generating MR Jobs (PigLatin, Hive) Executing MR Jobs
More informationAn Improved Performance Evaluation on Large-Scale Data using MapReduce Technique
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,
More informationGhislain Fourny. Big Data 5. Column stores
Ghislain Fourny Big Data 5. Column stores 1 Introduction 2 Relational model 3 Relational model Schema 4 Issues with relational databases (RDBMS) Small scale Single machine 5 Can we fix a RDBMS? Scale up
More informationBigTable: A Distributed Storage System for Structured Data
BigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationResearch Article Mobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 DOI:10.19026/ajfst.5.3106 ISSN: 2042-4868; e-issn: 2042-4876 2013 Maxwell Scientific Publication Corp. Submitted: May 29, 2013 Accepted:
More informationResearch and Improvement of Apriori Algorithm Based on Hadoop
Research and Improvement of Apriori Algorithm Based on Hadoop Gao Pengfei a, Wang Jianguo b and Liu Pengcheng c School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021,
More informationCLIENT DATA NODE NAME NODE
Volume 6, Issue 12, December 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Efficiency
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationCPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University
CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network
More informationLessons Learned While Building Infrastructure Software at Google
Lessons Learned While Building Infrastructure Software at Google Jeff Dean jeff@google.com Google Circa 1997 (google.stanford.edu) Corkboards (1999) Google Data Center (2000) Google Data Center (2000)
More informationJournal of East China Normal University (Natural Science) Data calculation and performance optimization of dairy traceability based on Hadoop/Hive
4 2018 7 ( ) Journal of East China Normal University (Natural Science) No. 4 Jul. 2018 : 1000-5641(2018)04-0099-10 Hadoop/Hive 1, 1, 1, 1,2, 1, 1 (1., 210095; 2., 210095) :,, Hadoop/Hive, Hadoop/Hive.,,
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationA New Model of Search Engine based on Cloud Computing
A New Model of Search Engine based on Cloud Computing DING Jian-li 1,2, YANG Bo 1 1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China 2. Tianjin Key
More informationData Sharing Made Easier through Programmable Metadata. University of Wisconsin-Madison
Data Sharing Made Easier through Programmable Metadata Zhe Zhang IBM Research! Remzi Arpaci-Dusseau University of Wisconsin-Madison How do applications share data today? Syncing data between storage systems:
More informationThe Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1
International Conference on Intelligent Systems Research and Mechatronics Engineering (ISRME 2015) The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian
More informationDynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c
2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic
More informationHadoop/MapReduce Computing Paradigm
Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications
More informationThe Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c
Applied Mechanics and Materials Online: 2013-09-27 ISSN: 1662-7482, Vols. 423-426, pp 2733-2736 doi:10.4028/www.scientific.net/amm.423-426.2733 2013 Trans Tech Publications, Switzerland The Design of Distributed
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationEvaluation of Multiple Fat-Btrees on a Parallel Database
DEIM Forum 2012 D10-3 Evaluation of Multiple Fat-Btrees on a Parallel Database Min LUO Takeshi MISHIMA Haruo YOKOTA Dept. of Computer Science, Tokyo Institute of Technology 2-12-1 Ookayama, Meguro-ku,
More informationA Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud
Calhoun: The NPS Institutional Archive Faculty and Researcher Publications Faculty and Researcher Publications 2013-03 A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the
More informationABSTRACT I. INTRODUCTION
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve
More informationSQL Query Optimization on Cross Nodes for Distributed System
2016 International Conference on Power, Energy Engineering and Management (PEEM 2016) ISBN: 978-1-60595-324-3 SQL Query Optimization on Cross Nodes for Distributed System Feng ZHAO 1, Qiao SUN 1, Yan-bin
More informationAN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang
International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA
More informationA Review Paper on Big data & Hadoop
A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College
More informationAnnouncements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414
Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s
More informationHadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017
Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google
More information