Multi-indexed Graph Based Knowledge Storage System

Size: px

Start display at page:

Download "Multi-indexed Graph Based Knowledge Storage System"

Ethelbert Small
5 years ago
Views:

1 Multi-indexed Graph Based Knowledge Storage System Hongming Zhu 1,2, Danny Morton 2, Wenjun Zhou 3, Qin Liu 1, and You Zhou 1 1 School of software engineering, Tongji University, China {zhu_hongming,qin.liu}@tongji.edu.cn, zyrusher@gmail.com 2 Well-Being and Social Sciences, Bolton University, UK dm3@bolton.ac.uk 3 School of computer science, Tongji University, China zhouwenjun77@hotmail.com Abstract. With the rapid development of information technologies, knowledge management systems are facing the problem of how to manage the massive volume of data and make an efficient usage for the data. In this paper, we analyzed the current challenge of knowledge management system and proposed a multi-indexed graph based knowledge storage system to avoid the data duplication and optimized for parallel processing. Keywords: Index, Storage System, Knowledge Management. 1 Introduction With the development of information technologies, access to datasets from many different sources is an important part of any knowledge management process. It is still very difficult, however, to extract the exact meaning, or related attribute, of the knowledge within the dataset. This paper explores the framework for the management, coding, and analysis of large heterogeneous datasets. In any knowledge management system, we usually find that the dataset comes from different sources with different formats, and varying degrees of structure. Some of the datasets are well structured, but there is also a lot of unstructured data arising from sources such as, wiki data, web data and social network data. The application of multiple sematic meanings to the same dataset, means that different applications may interpret or use the dataset in different ways. The objective of the research described in this paper was to design a framework of data storage, which would help overcome these problems in knowledge management systems. 2 Motivation 2.1 The Knowledge Management System Should Be Scalable and Optimized for Massive Dataset Analysis With the massive explosion of data in knowledge management system, we have to handle large volumes of dataset during data analysis. How to avoid the data duplication Z. Huang et al. (Eds.): WISE 2013 Workshops 2013, LNCS 8182, pp , Springer-Verlag Berlin Heidelberg 2014

2 Multi-Indexed Graph Based Knowledge Storage System 293 in different knowledge analysis project and how to optimized for parallel process of the data becomes the key challenges in modern knowledge management system. A unified, scalable and parallel storage architecture is needed for knowledge management system. 2.2 Structured, Semi-structured and Unstructured Data Can Be Managed in the Same Way In most of the knowledge management systems, the datasets will come from individuals or organizations. Some of these datasets will be well structured and based on the requirements of the originator, but it will not necessarily be very suitable for other analysis and processing around a different requirement. Indeed some datasets will be semi-structured and even unstructured, such as web data and social network datasets. The key is to design a storage and access system, which will work in the same way for all different types of dataset irrespective of their source and format. 2.3 Sematic Description Can Be Easily Added in the Dataset In any knowledge management system, it is very difficult to define a proper schema for a data set and never change it. In different applications, we may focus on different schemas for the same dataset, and we may need to add more sematic meanings to the dataset during the process of analysis. For example, when dealing with a customer referral system, we may regard the address as a whole sentence but when dealing with a logistical application we may need to divide the address into countries, states, streets etc., in order to quantify the data. Within the storage system, one dataset may have several schemas depending on the application. 2.4 Sematic Meanings of Knowledge Can Be Easily Added in the Graph Based Knowledge Management System In a knowledge management system, we will often have different definitions for the knowledge. Knowledge is usually defined by certain properties and there may also be linkages between different parts of the knowledge dataset. In order to describe the different sematic meanings of knowledge, we will regard a particular knowledge as a knowledge node type, atomic knowledge as an instance of a knowledge type node. For example, technical publications in a particular field can be regarded as a knowledge node type ; a single publication in this field can be an instance of the node type. The edge is defined as the linkage between the knowledge nodes. For example, papers written by the same author. Thus we can build a heterogeneous graph for knowledge management problems. We will import the idea of a knowledge schema to describe the heterogeneous graph in the knowledge management system. 2.5 The Storage System Can Be Scalable and Have Good Performance in Knowledge Processing With an increasing number of datasets within a knowledge management system, we have to think about the scalability, performance, and the processing of the data.

We can add and update an index based on graph schema, dataset schema, and the dataset. The index also needs to be processed in parallel for better performance.

3 294 H. Zhu et al. Distributed storage and processing is a potential solution for this problem. In order to get better performance, we do need to add some index in the distributed storage and process system. The index needs to be easily scaled. We can add and update an index based on graph schema, dataset schema, and the dataset. The index also needs to be processed in parallel for better performance. We will import the idea of index and delayed loading of a dataset into the knowledge management system. 3 Design Overview The distributed storage system we proposed is based on the Google distributed file system GFS [6]. It contains four parts: data nodes, which contains the dataset; data schema nodes and graph knowledge schema nodes, which contain the dataset schema and graph schema; index nodes, which contain the index for the dataset; graph knowledge nodes, which contain the knowledge graph. Figure 1 shows the architecture of the proposed multi indexed knowledge management storage system. Read ops NameNode Client Read Metadata ops Graph ops Graph Nodes Block ops SchemaNodes Index ops Metadata (Name, replicas,...) /home/foo/data,3,... Schema Blocks SchemaNodes Metadata (Name, replicas,...) /home/foo/data,3,... IndexNodes IndexNodes Blocks Replication DataNodes DataNodes DataNodes DataNodes DataNodes Rack 1 Rack 2 Write Client Fig. 1. The Logical Architecture of Multi-Indexed Storage System in Knowledge Management The dataset is stored as a pure row data in the distributed file system based on Hadoop. Hadoop [7] is an open source software framework that supports data intensive distributed applications, is designed for big data processing, and supports the running

4 Multi-Indexed Graph Based Knowledge Storage System 295 of applications on large clusters of commodity hardware. We add index features in the Hadoop system to achieve the knowledge management system. We regard the dataset, no matter whether it is a structured data or unstructured data, as files, which will be stored in the Hadoop system as row files. During the processing, we also try to delay the read loading data of the dataset for better performance. The data schema is the description about the dataset and the metadata of data index. One dataset may have more than one data schema. Each schema contains the self-described dataset structure and the metadata of the index. Since each row of data in the dataset has its own sematic meaning, the self-described dataset structure will present the position-sensitive or separate-sensitive meaning of the row data. The metadata of the index includes the availability of the index in each sematic meaning section of the dataset, the name, replicas, position of the index block. With the metadata of the index, we can easily find where is the index block. The data schema can be added when we import or first load the dataset, but we can also add more schemas after that. The data index gives the metadata of the data blocks. With the index system, we can locate the exact position of the data blocks, so we can just start the minimal numbers of jobs to read the dataset, which reduces the total running jobs in the distributed storage and processing system. The data index can be built and updated dynamically, and based on the availability of the resource and the statistics of the query, we can choose how many indexes we need and update the index during the off peak time of the system. The knowledge graph schema is very similar to the data schema. It is the description about knowledge heterogeneous graph and the metadata of the knowledge index. The information of the knowledge graph has three parts: node type, node and edge. Each category of the knowledge will be described as a node type; the instance of knowledge will be regarded as a knowledge node; and the edge will be the linkage of the two knowledge instances. The metadata of the knowledge index indicates whether the knowledge is indexed or not. The knowledge graph node stores the knowledge graph. We regard knowledge as a heterogeneous graph; the knowledge graph node contains the knowledge atomic item and the linkage between knowledge atomic items. In order to avoid the duplication of data set, the knowledge node will only contains one or several pointers, which point to the dataset. 4 Details of Design 4.1 Name Node The name node in our multi-indexed distributed knowledge management storage system is nearly the same as GFS name node. In GFS, the name node contain the metadata of the data node, but in our multi-indexed distributed knowledge management storage system, the index node contains the block metadata and a hash table of the logic block metadata and physical metadata. This means that in the event of hardware failure we do not need to update the index node.

5 296 H. Zhu et al. 4.2 Data Schema Node and Data Index Node The data schema contains the semantic description of the data set. One dataset can have more than one schema. The schema is the description of the sematic meaning of the data set. It shows what kinds of the properties are in the data set, the metadata of the properties, the availability of index for each property and the metadata information of index for each property. A typical data schema file will contain six parts: dataset name, dataset description, property names, property description, property separation metadata and property index metadata. For the property separation metadata, it will support the separators, allies, the prefix and postfix, and the combination and iteration of the three. For the property index metadata, if the index is available, it will point to the data index block; if it is not available, it will point to the dataset block. The data index contains the index information for each property; the index is full text index for the property. The value of the full text index is the data blocks for the key. The data index will be stored as a B+ tree for better search performance. 4.3 Knowledge Graph Schema Node Nowadays, knowledge is interconnected and interacts, forming numerous, large, interconnected and sophisticated networks. We regard it as a heterogeneous graph. In our knowledge storage system, we also add the graph schema and graph index into the system. The knowledge graph schema contains the sematic description of the knowledge. It includes the data schema to be used to support the knowledge, the property of the knowledge, the linkage between the knowledge and the linkage between the knowledge property and dataset property. A typical knowledge schema file will contain eight parts: knowledge names, knowledge description, proper names, property description, property metadata, linkage name, linkage description and linkage metadata. For the property metadata, it will contain the linkage between the knowledge property and the data schema property. For the linkage metadata, it will contain on which condition the node will have edge between and the direction of the edge. 4.4 Knowledge Graph Node The knowledge graph node contains information about the knowledge schema, the knowledge itself and the linkage between them. In order to avoid the duplication of the dataset, we will only store the knowledge id, node/edge type and the serious pointers, which point to the related info of row data in each related dataset. There is no data duplication in the knowledge graph system. If there is a query by graph id, we can go to the dataset directly, if not, based on the knowledge graph schema, we will transfer the query to a serious data schema query. Since most of the knowledge graph is very sparse, we will use adjacency list to store the knowledge graph.

Multi-Indexed Graph Based Knowledge Storage System 297 5 Query Process Analysis Figure 2 shows a query process in multi-indexed storage system in knowledge management.

InputStream 6: read 7: read Metadata (Data info) Knowledge Graph Graph Schema 1 Graph Schema 2... Data Schema 1 Data Schema 2... DataSet 1.1 DataSet 2.1 DataSet 1.n DataSet 2.

6 Multi-Indexed Graph Based Knowledge Storage System Query Process Analysis Figure 2 shows a query process in multi-indexed storage system in knowledge management. HDFS Clinet 1: open 3: open schema 5: read Distributed FileSystem SSData InputStream 2: get schema locations 4: get block locations NameNode Name Node Client JVM Client Node 8: close FSData InputStream 6: read 7: read Metadata (Data info) Knowledge Graph Graph Schema 1 Graph Schema 2... Data Schema 1 Data Schema 2... DataSet 1.1 DataSet 2.1 DataSet 1.n DataSet 2.n Graph Node (Knowledge Graph node) Graph Schema n Schema Node (Knowledge Graph Schema node) Data Schema n Schema Node (Data Schema node)... DataSet n DataSet n.n Metadata (Block info) Data Index Data Node (DataSet node) Data Node (DataSet node) Index Node (Data Index node) Fig. 2. Query Process in Multi-Indexed Storage System in Knowledge Management The HDFS (Hadoop Distributed File System) Client first sends an open request to the distributed file system (step 1 in fig. 2), then the request goes to the Name Node (step 2 in fig. 2), depending on the open request sent to open a dataset or a graph, the name node returns the proper schema node based on the file name/graph name. The HDFS Client then sends the query to the schema stream (SSData)(step 3 in fig. 2). The SSData queries the schema node for the block location of the dataset (step 4 in fig. 2). If it is a query for a graph, the graph schema node will follow the graph schema description; find the data schema node, then go to the data index to find the exact data block information. Alternatively it will find the block information directly if it is a query of graph node id. The data block information will then be returned to the HDFS Client. After the HDFS Client gets the data block information, it will send the query to file stream (FSData)(step 5 in fig. 2). The FSData then goes to the exact data block to read all of the dataset it needs (step 6, 7 in fig. 2). 6 Performance Optimization Distribution issues: the whole multi-indexed storage system is based on the Hadoop distributed file system. We add two new types of node, the schema node and index

7 298 H. Zhu et al. node. They act very similarly with the data node; they all need to report the heart beat to the name node. We still keep the Master/Slave architecture for the whole system. Different to some Lucene based index systems; we separate the index and schema from the original dataset for better scalability and control. Thus we can simply limit the volume of the index, also based on the statistical information of the query, we can dynamically change how much index we need and separate the update of index from the update of the dataset. Performance of execution: The purpose of adding an index to the system is to reduce the total number of jobs running on the knowledge management system. We try to locate the exact block before the map-reduce job is started. We also try to delay the real loading of the dataset. A lot of processing such as set join, merge and difference can be done at the index level[8]. Compared to the non-indexed Hadoop system, we increase the work of building the index and find the block info in the schema-index system. As we described before, the building of the index is a separate job now and it can be done during the off-peak time of the system. The finding of the index will add data loading and process from schema node and index value before map-reduce is started, which will add the traffic of the internal network of Hadoop system. This is acceptable since the size of the load is small. If we hit in the index, we can reduce a lot of jobs for map processing. Performance of building the index: We will add a default data schema to all the dataset, which marked as no attribute and all the blocks as index value. If we are querying an existing data schema with no index for a particular attribute, we will also add all the blocks of the dataset as its value. Through these, the index system can be work even if there is no index. So we can delay the building of index and based on the statistic info, we can build the index dynamically. 7 Experimental Evaluation This new application is based on previous work relating to a transaction fraud detection system. In this application data is obtained from an internal transaction record system; data is also obtained from social networks and third party organizations that are related to the account holder. We then build a heterogeneous knowledge management system in order to analyze the historical data and find potential patterns for fraudulent transactions. In the prototype of this system, we analyzed over 50,000 accounts related to over 2million transactions, combining the data with data from social network records, ip addresses and addresses. The heterogeneous knowledge analysis was then undertaken for three different node types: Create Account, Cash Transaction and Credit Transaction. We add index to the transaction dataset and social network dataset. The x-axis of figure 3 and figure 4 is the paralleled query numbers; the y-axis is the execution time in Second. In figure 3, only 10% attributes are indexed and 80% indexed in figure 4.

8 Multi-Indexed Graph Based Knowledge Storage System 299 Fig. 3. The end-end execution time for parallelized query (10% indexed) Fig. 4. The end-end execution time for parallelized query (80% indexed) From the end-end execution time for parallelized query, we can see that in figure 3, because only 20% indexed, the performance of indexed system is not very different from the original Hadoop system. But in figure 4, the performance will be largely

9 300 H. Zhu et al. increased if the system overall load is not heavy, the heavier system load, the increase of the performance gain in not that significant. 8 Related Work The integration of full-text indexing within a relational database is not a new idea; Oracle, IBM and Microsoft have done a lot of work along these lines. Jimmy Lin et al. [2] from twitter give a full-text index to optimize selection operations on text fields within records. M. Cafarella and C. R_e also make a solution for optimizes selection operations in Hadoop programs. [9] Hadoop++ [10] also injects trojan indexes into Hadoop input splits at data loading time [10]. Haojun Liao [1] et al. also give a R+ based full text index solutions on data node. Some research is attempting to bridge relational databases and Map-Reduce programming models. Examples include an extension of the original Map-Reduce model called MapReduceMerge [3] to better support relational operations, HadoopDB[4], an architectural hybrid that integrates Hadoop with PostgresSQL, and Dremel[5], which takes advantage of columnar compression for large-scale data analysis. Other research are adding index for graph system. Interval labeling [11] and 2HOP labeling [12] are the typical solutions on this field. The interval labeling approaches use min-post-labeling or pre-post-labeling on a spanning subtree of the DAG. In the 2HOP index, each node determines a set of intermediate nodes it can reach, and a set of intermediate nodes that can reach it. 9 Conclusion and Future Work In the multi-indexed knowledge management system we propose, we add sematic meaning into the dataset, build heterogeneous knowledge graph and add index to optimize for the query performance. In the future research, we will focus on performance turning such as how to improve the performance of the index building and delayed data load in the storage system; we will also focus on how to use the indexed storage system to optimize the knowledge management and sharing process. Acknowledgement. This research is supported by NSFC No , the Science and Technology Commission of Shanghai Municipality funding for the Research on Cloud based Data Analysis and Processing in the internet of things (No ). We would like to express our sincere thanks to them.

10 Multi-Indexed Graph Based Knowledge Storage System 301 References 1. Liao, H., Han, J., Fang, J.: Multi-dimensional Index on Hadoop Distributed File System. In: IEEE Fifth International Conference on Networking, Architecture and Storage (NAS) (2010) 2. Lin, J., Ryaboy, D., Weil, K.: Full-text indexing for optimizing selection operations in large-scale data analytics. In: Proceedings of the Second International Workshop on MapReduce and Its Applications (2011) 3. Yang, H., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-Reduce-Merge: Simplified relational data, processing on large clusters. In: SIGMOD (2007) 4. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: VLDB (2009) 5. Melnik, S., Gubarev, A., Long, J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: Interactive analysis of web-scale datasets. In: VLDB (2010) 6. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: Bigtable: A distributed storage system for structured data. In: OSDI (2006) 7. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI (2004) 8. Abadi, D.J.: Materialization strategies in a Column oriented DBMS, ICDE, Istanbul, Turkey, pp (2007) 9. Cafarella, M., Ré, C.: Manimal: Relational optimization for data-intensive programs. In: WebDB (2010) 10. Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). In: VLDB (2010) 11. Agrawal, R., Borgida, A., Jagadish, H.V.: Efficient management of transitive relationships in large data and knowledge bases. SIGMOD Rec. 18(2), (1989) 12. Cheng, J., Yu, J.X., Lin, X., Wang, H., Yu, P.S.: Fast computing reachability labelings for large graphs with high compression rate. In: EBDT (2008)

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment

DEIM Forum 213 F2-1 Adaptive indexing 153 855 4-6-1 E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with