Implementation of Relational Operations in Omega Parallel Database System *

Size: px

Start display at page:

Download "Implementation of Relational Operations in Omega Parallel Database System *"

Collin Stevenson
5 years ago
Views:

1 Implementation of Relational Operations in Omega Parallel Database System * Abstract The paper describes the implementation of relational operations in the prototype of the Omega parallel database system for the Russian multiprocessor MVS-100/1000. This approach is based on an original mechanism for parallelizing query execution called the stream model. The stream model uses a special Omega exchange operator for arranging the query engine parallelization. The structure of the Omega exchange operator is presented. The implementation of several physical algebra operations is described. 1. Introduction One of the important issues concerning the implementation of parallel Data Base Management Systems (DBMS) is the issue of query execution parallelization. This paper describes organization of parallel query executor in the prototype of the parallel Omega DBMS [1] for the Russian-made multiprocessor MVS-100 [2]. The Omega system has a three level hierarchical hardware architecture. This hardware architecture is characterized by reliability and high 1 data * This work was supported by the Russian Foundation for Basic Research under Grants and Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the CSIT copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Institute for Contemporary Education JMSUICE. To copy otherwise, or to republish, requires a fee and/or special permission from the JMSUICE. Proceedings of the 3 rd International Workshop on Computer Science and Information Technologies CSIT 2001 Ufa, Yangantau, Russia, 2001 Tatyana Y. Lymar Computer Science Department Chelyabinsk State University Chelyabinsk, Russia lymar@csu.ru availability in cases of hardware failure, while providing high performance [3]. There are two well-known models used for implementation of parallel query executors: the bracket and the operator models [4]. The bracket model was used in a number of parallel DBMS such as Gamma [5] and Bubba[6]. The operator model was used in the implementation of the parallel DBMS Volcano [7]. In the Omega system we use a novel mechanism of paralleling query execution, which we called the stream model. This model incorporates advantages of the both bracket and operator models and optimized for the hardware peculiarities of the Omega System. This model utilizes the producer/consumer paradigm and data drive/data flow mechanism for efficient data exchange between operators. Each operation of the query tree is represented as a single lightweight process (a thread). In the Omega System each process is taken as a root thread (only one process can run on each processor module). Any thread may initialize any number of daughter threads. Thus, the threads form a hierarchy, which is supported by the thread manager. A value of dynamic priority calculated with the help of factor function of a thread is used to pass control over among the threads. A more detailed description of the thread manager and the formulas of dynamic priority calculations may be found in the paper [8]. In order to implement intraoperation parallelism, stream model utilizes a special exchange operator. It encapsulates all the parallelism of the query executor. It allows us to use sequential algorithms for implementation of main relational operations in the Omega system. The algorithms realizing to date describes in the last section of this paper. 2. Query executor of the Omega system Query executor of the Omega system is a virtual machine, which is capable to execute physical queries, that are the queries expressed in terms of physical algebra. On the Implementation of Relational Operations in Omega Parallel Database System 50

2 level of physical algebra, any kind of parallelism in query execution is implemented explicitly. In particular the arguments and results of operations of physical algebra are fragments of relations. Parallel operations based on relations partitioning are implemented on the higher levels of system hierarchy. The hierarchy of modules and classes of the query executor is shown in the Figure 1. Tree Class Executor Stream Class Stock Class Operations interpreter Figure 1 Structure of the query executor The stock class realizes the abstract Stock type. This type is necessary for implementation of the producer/consumer model. The stream class realizes the abstract Data Stream type, which is incorporating all available types of data sources. The implementation of this class is described in the next section. Module of the physical algebra operations interpreter describes the structure and types of parameters of relational operations and provides the procedures of the operations realization. The work of this module is described in detail in section 4. The tree class realizes the abstract Query tree type and provides the methods of creating and bypassing query trees. The executor module realizes bracket model and provides methods of creating, executing and eliminating physical query trees. 3. The stream model function, the pointer to thread factor-function, the pointer to output stream (to be described below), the pointer to the left son, the pointer to the right son (if one exist) and the type of the node (disjunctive or conjunctive). In this context, the factor-function calculates a number of granules that reside in the output stream at the current point of time. The number of sons is limited to two since there are only unary and binary operators in the Omega system [8]. While nodes of the query tree represent threads executing corresponding operations, the edges represent the streams [9]. A stream is a generalization of file concept; it operates as a virtual FIFO file. A stream is a realization of a fragment of a relation on the level of physical algebra. The streams are arguments and results of physical algebra operations, which means they are the input and output data of query tree nodes. The query executor of the Omega system supports streams of the following predefined types: a stored file (a fragment of a relation), a temporary file, a conductor channel, a router channel, and a stock [9]. The choice of the particular mechanism of data transfer depends on the physical location of the adjacent nodes: 1. If the nodes are placed on the same processor they are realized as threads and are executed simultaneously. Data may be exchanged between threads with the help of two types of streams: a temporary file and a stock; 2. In case nodes are situated on different processors of the same cluster data is transferred through the conductor channel; 3. If the nodes are executed in different clusters data is transferred through the router channel. Besides, the input data of the query tree is stored on the disks and the query executor has to obtain it from the file system by means of a stream of file type. Most authors regard the above mentioned situations as separate problems concentrating mostly on inter-process exchanges [4]. We have decided to integrate these cases into a single concept. In order to unify the data transfer interface we introduce the concept of stream incorporating all of the above situations: the streams realize the universal mechanism of data exchange between processes and disks, a process and a conductor or a router channel, and between two processes. The proposed stream model is based on the bracket template conception [4]. In the Omega system, query tree The described conception of data exchange is realized in nodes are implemented by using the common bracket the query executor by a stream class. Each member of this template, which can consume and produce data granules class has the following attributes: and can execute exactly one operator. Each query tree node is built on bracket template by assigning the Stream type; following main attributes: the pointer to thread body A virtual file identifier; Element size in bytes; Implementation of Relational Operations in Omega Parallel Database System 2

3 A pointer to the working buffer; Pointers to functions of streams management. Information about the existing streams is stored in a static array. The ordinal number of an element of the array is the identifier of the corresponding stream. The stream class provides the basic methods of creating abstract objects of the stream type and the service functions realizing the concrete stream types and access to them. Basic functions include the functions of creating and deleting streams and the function of accessing the attributes of the stream. The basic functions of creating and deleting a member of the stream class execute corresponding actions in the stream descriptor array. The service functions include the following actions: Open the stream; Close the stream; Reset the stream; Put data granule to the stream; Get data granule from the stream; Check completion of exchange in the stream; Return the number of data granules in the stream. Read and write operations for all types of streams except the stock type are asynchronous, and this stipulates the necessity to check completion of data exchange in a stream. The interface of service functions of the stream class is maximally unified. This allows addressing a stream disregarding the concrete mechanism of its work, which significantly facilitates for further development of the system as a whole and the query executor in particular. Let s consider in greater detail the realization of each stream type. Streams of file type The representation of a stream of file type is an open stored file. It means that before creating and using a stream the corresponding file has to be opened in a necessary mode (read-only if this stream admits no writes; or read-write if the writes are allowed). The operations of opening, closing and resetting the file stream are independent of the file s state. The corresponding action is performed on the file iterator, i.e. when a stream is opened the iterator is created, when a stream is closed the iterator is eliminated, when a stream is reset the iterator is set before the first record in the file. Reading a record in the file is realized by actions on the iterator: a pointer to the current record is moved to the next position and returned into the function as a result. Thus, the data of the file streams can be used repeatedly. Streams of temporary file type. The representation of a stream of temporary file type is a temporary file. It is created and deleted together with the corresponding stream. While resetting the stream the corresponding file iterator is set to initial state. Records of the temporary file when reading and resetting are not deleted, thus streams of temporary file type allow using the data repeatedly. Streams of conductor channel type. The representation of a stream of conductor channel type is a conductor channel. When creating and opening a stream no channel is created. A channel is created only when writing or reading operations are executed. The channel is eliminated immediately after the operation is completed. Operation of resetting for the conductor channel stream is void because streams of this type do not allow using the data repeatedly. Streams of router channel type. The representation of a stream of router channel type is a router channel. Its mechanism is analogous to channel conductor. Streams of stock type. A stock is created and deleted together with the corresponding stream. When operation Reset is executed the stock is emptied. When an element from a stream is reading, this element is deleted from the stock. Thus, the data of stock type streams can be used repeatedly. In the Omega system, the stocks are the basic types of streams used to represent query trees [9]. A stock is a FIFO buffer situated in main memory. When stock length is equal to one, the stock model actually becomes equivalent to the classical iterator model used for managing data flow in most DBMS s [4]. The test results [10] show that using stocks longer then one element can provide better performance in case of data skew. Concept of stock is realized in the query executor by the stock class. This class represents an abstract type of output thread buffers in the producer/consumer model. The elements of the stock class are byte strings of fixed length, structured as a queue. Elements may be placed and taken from the stock. Information about the stocks is contained in descriptors stored in a static array. A descriptor includes the following fields: maximum number of elements in a stock, length of elements, an address of the memory block allocated for the stock, a pointer to the current top and bottom of the stock. Each existing stock is connected with a value, which is called stock-filling factor. It is dynamically calculated and taken into account while scheduling the threads. Workshop on Computer Science and Information Technologies CSIT 2001, Ufa, Yangantau, Russia,

4 The Omega bracket template does not support intraoperator parallelism. In order to support intraoperator parallelism on partitioned datasets, we introduced a novel Omega exchange operator. The structure of the Omega exchange operator is shown in Figure 2. It consists of the following four suboperators: split operator, gazer operator, scatter operator and merge operator. All this operators are implemented by using the generic bracket template. Gather Merge Split Scatter Figure 2. Exchange Operator Structure The split operator divides its input into two parts using the partitioning function. The first part includes just those data granules, which have to be processed inside the local node. These granules are directed to the output stock of the split operator. The second part consists of those data granules, which have to be processed on remote nodes. These granules are directed into the output stock of the scatter operator, which serves in this context as an input stream. In context of generic bracket template, the split operator is implemented as an binary conjunctive operator whose right input stream serves as the second output stock. The scatter operator consumes data granules from its own output stream and sends them to corresponding Omegacluster nodes by using the partition function. In context of generic bracket template, the scatter operator is implemented as an operator without input streams whose output stream serves input stock. The gather operator constantly reads data granules from all Omega-cluster nodes excepting its own node and puts the read granules into its output stock. In context of generic bracket template, the gather operator is implemented as an operator without input streams. The merge operator receives data granules from its input streams and merges them into the output stream. In context of generic bracket template, the merge operator is implemented as a binary disjunctive operator. The Omega exchange operator is implemented by using the common generic template; therefore, it can be inserted at any one place or at multiple places in a complex query tree. The other operators are entirely unaffected by the presence of Omega exchange operators in query plan. The Omega exchange operator does not contribute to data manipulation on the logical level. On the physical level, however, it provides facility not provided by any of the normal operators, i.e. data redistribution. Thus, the Omega exchange operator encapsulates all parallelism issues and makes implementation of parallel database algorithms significantly easier and robust. 4. Realization of relational operations As it has been mentioned above, on the level of the query executor parallelism of executing queries is implemented explicitly. In particular, the relational operations are performed on the fragments of relations and are not parallel. Here are algorithms of realization of some basic relational operations. Scan This operation sequentially takes records from the input stream, checks the given condition, and puts records that met the condition into the output stream. Both input and output streams must be opened before the operation can be executed. static int _scan(input, output, predicate){ static char *buf; ; if (stream_eod(input, buf = stream_read(input))){ stream_writeeod(output); ; if (predicate!= NULL) if (!((*predicate)(buf))) stream_write(output, buf); Figure 3. Algorithm of SCAN Operation The condition of selection is given by a separate parameter called the selection predicate, which is a pointer to function. If the given parameter is NULL all the records from the input stream are placed into the output stream. Otherwise only those records are selected for which the given predicate is TRUE. The predicate is given a pointer to the current element of the stream as a parameter. Split This operation sequentially scans elements of the input stream. For each element predicates of all streams of the output fan are calculated. The element is copied exactly to Implementation of Relational Operations in Omega Parallel Database System 4

5 those streams of the output fan the predicates of which return TRUE value. Here fanout is a pointer to the output streams array, dimout is the number of outputs, predicatev is the predicate vector. static int _split(input, *fanout, dimout, predicatev){ static char* buf; int i; Merge if (stream_eod(input, buf=stream_read(input))){ for (i=0; i<dimout; i++) stream_writeeod(fanout[i]); if (predicatev!= NULL) for (i=0; i<dimout; i++) if((predicatev[i])(buf)) stream_write(fanout[i], buf); Figure 4. Algorithm of SPLIT Operation Merges two input streams into one output stream. The records of both input streams take turns entering the output stream until one of the streams is over. No duplicates are removed in the process. static int _merge(inputa, inputb, output){ static char eoda = FALSE; static char eodb = FALSE; char* buf; if (!(eoda)) if (stream_eod(inputa,buf=stream_read(inputa))){ eoda = TRUE; else{ stream_write(output, buf); if (!(eodb)) if (stream_eod(inputb,buf=stream_read(inputb))){ eodb = TRUE; else { stream_write(output, buf); if (!(eoda && eodb)) stream_writeeod(output); ; Figure 5. Algorithm of MERGE Operation Product Calculates the direct product of input streams. The result is written to the output stream. The stream scanned in the internal cycle (to which the right sub-tree in the query tree corresponds) must allow using the data repeatedly. static int _product(inputa, inputb, static char eodb = FALSE; char *bufa, *bufb, *bufc = stream_a(output)->buf; int i, widgetlena = stream_a(inputa)->widgetlen; if (stream_eod(inputa,bufa=stream_read(inputa))){ stream_writeeod(output); if (eodb) { stream_reset(inputb); while (!stream_eod(inputb, bufb=stream_read(inputb))){ for (i=0;i< widgetlena; i++) bufc[i] = bufa[i]; for (i=widgetlena; i<stream_a(output)->widgetlen; i++) bufc[i] = bufb[i- widgetlena]; stream_write(output, bufc); ; eodb = TRUE; Figure 6. Algorithm of PRODUCT Operation 5. Conclusion This paper proposes a novel method of the implementation of relational operations. This method bases on a model of paralleling queries, which is called a stream model. The proposed model allows to automatically paralleling executing queries to any number of processor. This is achieved by means of a special exchange operator. Exchange operator encapsulates all the mechanisms necessary for realizing intraoperator parallelism. This approach significantly facilitates implementation of parallel algorithms in DBMS s and makes them more reliable. The algorithms of relational operations, which realized to date in Omega system, are described. The differentiating feature of the stream model is automatic synchronization and scheduling processes of executing operations in the query tree. The stream model includes the bracket template and the class of data objects called streams. The described mechanisms provide the better system performance in the presence of data skew. The described model was realized in the prototype of the parallel Omega system based on the MVS-100 in 8- processor configuration. Tests have been indicating high efficiency of the proposed approach. References 1. Sokolinsky L, Axenov O, Gutova S. Omega The Highly Parallel Database System Project. In the proceedings of the First East-European Symposium on Advances in Database and Information Systems (ADBIS'97), St.-Petersburg, 1997, vol. 2, pp Zabrodin A.V., Levin V.K., Korneev V.V. The Massively Parallel Computer System MBC-100. In Workshop on Computer Science and Information Technologies CSIT 2001, Ufa, Yangantau, Russia,

6 the proceedings of PaCT-95 (Lecture Notes in Computer Science), 1995, vol. 964, pp Sokolinsky L.B. Operating System Support for a Parallel DBMS with an Hierarchical Shared-Nothing Architecture. In the proceedings of the Thirdth East- European Symposium on Advances in Database and Information Systems (ADBIS'99), Maribor, Slovenia, 1999, pp Graefe G. Query evaluation techniques for large databases. ACM Computing Surveys June, 1993, vol. 25, ¹ 2, pp DeWitt D.J., et al. The Gamma database machine project IEEE Transactins on Knowledge and Data Engineering, March 1990, vol. 2, ¹ 1, pp Boral H., et al. Prototyping Bubba a Highly Parallel Database System IEEE Transactins on Knowledge and Data Engineering, March 1990, vol. 2, ¹ 1, pp Graefe G. Encapsulation of Parallelism in the Volcano Query Processing Systems. In the proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, 1990, pp Sokolinsky, L.B. Interprocessor Communication Support in the Omega Parallel Database System. In the proceedings of the First International Workshop on Computer Science and Information Technologies (CSIT 99), Moscow, 1999, vol Lymar T.Y., Sokolinsky L.B. Data Streams Organization in Query Executor for Parallel DBMS. In the proceedings of the 4th IEEE International Baltic Workshop, Lithuania, Vilnius, 2000, vol. 1, pp Sokolinsky L.B. Design and Evaluation of Database Multiprocessor Architecture with High Data Availability. In the proceedings of the 12th International DEXA Workshop, Munich, Germany, Implementation of Relational Operations in Omega Parallel Database System 6

Implementation Principles of File Management System for Omega Parallel DBMS *

Implementation Principles of File Management System for Omega Parallel DBMS * Mikhail L. Zymbler Chelyabinsk State University Russia mzym@cgu.chel.su Leonid B. Sokolinsky Chelyabinsk State University Russia