ANALYSIS OF THE FEATURES OF THE OPTIMAL LOGICAL STRUCTURE OF DISTRIBUTED DATABASES

Similar documents
RUSSIAN DATA INTENSIVE GRID (RDIG): CURRENT STATUS AND PERSPECTIVES TOWARD NATIONAL GRID INITIATIVE

CLOUDS OF JINR, UNIVERSITY OF SOFIA AND INRNE JOIN TOGETHER

GEOMETRY DATABASE FOR THE CBM EXPERIMENT AND ITS FIRST APPLICATION TO THE EXPERIMENTS OF THE NICA PROJECT

LCG Conditions Database Project

Federated data storage system prototype for LHC experiments and data intensive science

Meta-cluster of distributed computing Dubna-Grid

CouchDB-based system for data management in a Grid environment Implementation and Experience

TERNARY TREES USAGE FOR COMPUTATIONAL EXPERIMENT DATA STORAGE

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

CMS - HLT Configuration Management System

Development of DKB ETL module in case of data conversion

The JINR Tier1 Site Simulation for Research and Development Purposes

CMS users data management service integration and first experiences with its NoSQL data storage

LHC Grid Computing in Russia: present and future

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

Intelligent flexible query answering Using Fuzzy Ontologies

The High-Level Dataset-based Data Transfer System in BESDIRAC

INDEXING OF ATLAS DATA MANAGEMENT AND ANALYSIS SYSTEM

AGIS: The ATLAS Grid Information System

PARAMETERS OF OPTIMUM HIERARCHY STRUCTURE IN AHP

Maximizing Bichromatic Reverse Nearest Neighbor for L p -Norm in Two- and Three-Dimensional Spaces

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Lecture 2 The k-means clustering problem

Chapter 4: Multithreaded Programming

Common Parallel Programming Paradigms

arxiv: v1 [cs.dc] 12 May 2017

Chapter 4: Multithreaded Programming

CSCS CERN videoconference CFD applications

Parallel Rendering. Johns Hopkins Department of Computer Science Course : Rendering Techniques, Professor: Jonathan Cohen

The CORAL Project. Dirk Düllmann for the CORAL team Open Grid Forum, Database Workshop Barcelona, 4 June 2008

Exploring Domains of Approximation in R 2 : Expository Essay

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Indexing and selection of data items in huge data sets by constructing and accessing tag collections

Grid Computing a new tool for science

DIRAC distributed secure framework

CS 345A Data Mining. MapReduce

Detector Control LHC

A host selection model for a distributed bandwidth broker

Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY

arxiv: v1 [physics.comp-ph] 25 Dec 2010

Monitoring of large-scale federated data storage: XRootD and beyond.

Monte Carlo Production on the Grid by the H1 Collaboration

Large Scale Software Building with CMake in ATLAS

Implementation Principles of File Management System for Omega Parallel DBMS *

Inverted Index for Fast Nearest Neighbour

Research on Quantitative and Semi-Quantitative Training Simulation of Network Countermeasure Jianjun Shen1,a, Nan Qu1,b, Kai Li1,c

Unified System for Processing Real and Simulated Data in the ATLAS Experiment

Automated Clustering-Based Workload Characterization

The LHC Computing Grid

Introduction to DBMS

Summary of the LHC Computing Review

MANAGEMENT AND PLACEMENT OF REPLICAS IN A HIERARCHICAL DATA GRID

Evaluation of the computing resources required for a Nordic research exploitation of the LHC

ISTITUTO NAZIONALE DI FISICA NUCLEARE

PARALLEL FRAMEWORK FOR PARTIAL WAVE ANALYSIS AT BES-III EXPERIMENT

RDMS CMS Computing Activities before the LHC start

Data Integration and Data Warehousing Database Integration Overview

An SQL-based approach to physics analysis

Coded Distributed Computing: Straggling Servers and Multistage Dataflows

Andrea Sciabà CERN, Switzerland

Decomposition into modules

COMP3702/7702 Artificial Intelligence Week2: Search (Russell & Norvig ch. 3)" Hanna Kurniawati"

GRID SCHEDULING USING ENHANCED PSO ALGORITHM

CONTROL SYSTEM OF THE BOOSTER INJECTION POWER SUPPLY

Trigger and Data Acquisition at the Large Hadron Collider

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

A customizable approach to full lifecycle variability management

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Analysis of Dynamic QoS Routing Algorithms for MPLS Networks

Distributed Databases Systems

Early experience with the Run 2 ATLAS analysis model

Traffic engineering for MPLS-based virtual private networks

A New Segment Building Algorithm for the Cathode Strip Chambers in the CMS Experiment

An Efficient Semantic Image Retrieval based on Color and Texture Features and Data Mining Techniques

Chapter 1: Introduction

Improving Image Segmentation Quality Via Graph Theory

3D Kinematics. Consists of two parts

Machine Learning for Video-Based Rendering

COOL performance optimization using Oracle hints

Automating usability of ATLAS Distributed Computing resources

Verification and Diagnostics Framework in ATLAS Trigger/DAQ

Introduction To Computers

Feature Selection Using Principal Feature Analysis

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

8 th International Conference on DEVELOPMENT AND APPLICATION SYSTEMS S u c e a v a, R o m a n i a, M a y 25 27,

Unit 4: Formal Verification

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

PARALLEL ALGORITHMS FOR SOLVING ORDINARY DIFFERENTIAL EQUATIONS

and Photonics" of Russian Academy of Sciences, Samara, Russia

Programming II. Modularity 2017/18

Efficient Synthesis of Production Schedules by Optimization of Timed Automata

Optimized Implementation of Logic Functions

A Tool for Conditions Tag Management in ATLAS

Event cataloguing and other database applications in ATLAS

INTELLIGENT PROCESS SELECTION FOR NTM - A NEURAL NETWORK APPROACH

The CMS Computing Model

(Type your answer in radians. Round to the nearest hundredth as needed.)

The V*-Diagram: A Query-Dependent Approach to Moving KNN Queries

ONLINE INDEXING FOR DATABASES USING QUERY WORKLOADS

Transcription:

ANALYSIS OF THE FEATURES OF THE OPTIMAL LOGICAL STRUCTURE OF DISTRIBUTED DATABASES E.V. Nurmatova 1,a, V.V. Gusev 2, V.V. Kotliar 2 1 «MIREA Russian Technological University», Moscow 2 NRC "Kurchatov Institute" IHEP E-mail: a nurmatova@mirea.ru The uestions of constructing optimal logical structure of a distributed database (DDB) are considered. Solving these issues will mae it possible to increase the speed of processing reuests in DDB in comparison with a traditional database. Optimal logical structure of DDB will ensure the efficiency of the information system on computational resources. The problem of constructing an optimal logical structure of DDB is reduced to the problem of uadratic integer programming. As a result of its solution, the local networ of the DDB is decomposed into a number of clusters that have minimal information connectivity with each other. In particular, such tass arise for the organization of systems for processing huge amounts of information from the Large Hadron Collider. In these systems various DDBs are used to store information about:1) the system of triggers of data collection from physical perimental installations (ATLAS, CMS, LHCb, Alice), 2) the geometry and the operating conditions of the detector while collecting perimental data. Keywords: optimal logical structure, uery implementation scheme, distributed databases, grid architecture. 218 Elena V. Nurmatova, Victor V. Gusev, Victor V. Kotliar 579

1. Introduction Optimal logical structure of DDB will ensure the efficiency of the information system on computational resources.such tass arise in various areas of information and communication technologies, where distributed systems, systems with high loads, or systems of increased reliability are already used. In particular, such tass arise for the organization of systems for processing huge amounts of information from the Large Hadron Collider (LHC). In these systems various DDBs are used to store information about: the system of triggers for collecting data from physical perimental installations (ATLAS, CMS, LHCb, Alice), the geometry of the detector while collecting perimental data, description of the metadata of the events on the detectors (for ample, maintenance wor on the installation), the operating conditions of the detector while collecting perimental data [1]. The last DDB, the Conditions database, is the most compl in its organization. For access to it a special program interface CORAL [2] was written. The use of DDB directly in the process of data processing of the LHC in the distributed computing environment Grid should also be noted. To optimize the transfer of the results of calculations of individual tass over computer networs, a multilevel hierarchical system is used. This system eeps trac of where data is stored and where it is necessary to transfer them [3]. 2. Synthesis of the optimal logical structure of a distributed database DDB construction technologies can be used for storing data in monitoring systems of compl computing infrastructures, which are a combination of engineering, networ, software environments[4]. The constant search for methods to reduce the costs of data storage and their processing served as an objective need to deepen research on these issues and determined the relevance of the topic of this wor. Synthesis of the optimal logical structure of a distributed database is the process of finding the optimal mapping of the canonical structure of the DDB to a logical one, which provides the optimal value of a given criterion for the performance of information systems and satisfies the basic system, networ, and structural constraints [5]. When mapping a canonical structure into a logical one the groups of data of the canonical structure of the DDB are combined into types of logical records with their simultaneous distribution among the nodes of the computing system. The logical structure of DDB will be understood as an ordered set of logical records and connections (relations) between them, distributed over the nodes of the computing system. These logical records and connections reflect the semantic and functional properties and features of a given subject area of the information system. Let the logical structure of the database be given by the graph G ( N, L). To implement the -th uery on the graph G ( N, L), it is necessary to select the search tree G N, L ), where ( N N is a subset of vertices, and L L is the subset of connections of the graph G chosen by the search tree G. It should be noted that one of the vertices structure of the database. N is the entry point to the logical Figure 1 shows an ample of a search tree G and the direction of its bypass. Reducing the response time to a uery can be achieved by eliminating the vertices to which it is necessary to return for passing to the nt branch. The graph G u thus obtained will contain only those vertices (records) whose elements form the output structures of the uery. Thus, the tas of 58

minimizing the response time to a uery of the -th type is reduced to the problem of choosing the optimal connection between the records of the graph G. u Figure 1. The scheme of the graph G ( N, L ) and of direction of its bypass Figure 2. The scheme of a complete graph G u and connections between all vertices of a subset The scheme of a complete graph (see Figure 2) can be used to pose the problem of choosing the optimal structure of connections between records used to form the output structures of the -th uery with restrictions on the possibility of using a certain access path. The created DDBs can have a large dimension and therefore they are loaded and introduced in parts. To this end, the LS of DDB should be divided into a number of substructures or clusters that have minimal mutual connectivity under the following restrictions: on the dimension of clusters, on types of storage devices used, on the degree of semantic proximity of logical records included in clusters, etc. The source for the formulation and solution of this problem is the information obtained as a result of the synthesis of the LS of DDB (see Table 1). N { n j / j 1, J} W F w jj ' f K ' jj u N Table 1. The notation for source variables. the multiplicity of the record types of the LS of DB the matrix of connections of records structure the matrix of access paths for implementing ueries of users R r, r j,, r } the vector of a set of logical records { 1 J, j,, } the vector of characteristics of records length in bytes { 1 J To formalize the tas, we introduce the variables X 1, if the j-th record is included in the e-th cluster; X - otherwise. je je 581

The tas of decomposition of the logical structure of the RDB into the multiplicity of clusters that provide minimal connectivity between them is formulated as follows: J {min x ' ' j, }, 1 j, j 1 x ' jj j x ' ' j, with restrictions on the one-time inclusion of a logical record in the cluster, x j 1 1, j 1, J, and with restrictions on the total number of logical records in a cluster j j1 x j M 1, 1,, where M is an allowable number of records in the cluster. This tas is a problem of uadratic integer programming. As a result of solving the posed problem, the LS of DDB is decomposed into a number of clusters, which have minimal information connectivity with each other. The development of a database of a separate cluster can be further carried out taing into account the importance of the logical records included in it in terms of user reuirements, development complity, and other factors. The solution of such a tas is of great practical importance for computer-aided design of the LS of DDB and for generating specifications for ueries and corrections of DDB. This is especially important when organizing a DDB in the client-server architecture, in which the uery language SQL is used as an interface. The uery specifications for it include two main parts: uery objects and search conditions. In the reuest objects the information elements and the logical records reuired by the user are listed. It should be noted that the manual construction of ueries taes a long time, because the user reuires a detailed nowledge of the composition and structure of logical records, the interconnections between them, the characteristics of the organization, etc. 3. The logic circuit implementation of the uery Queries in the DDB are characterized by the composition of the reuested data, the freuency of their use, and the performance characteristics of the average values of the number of analyzed and selected when searching for instances records. The logic scheme of the implementation of a uery of DDB user includes the following seuence of operations: Designing and initiating a reuest by the user in the node of the computing system to which it is attached, in the uery language of the selected control system of DDB. Transferring the information to the database server via communication channels for the implementation of the uery. Processing the uery using the appropriate methods and by means of the database and solving the following main tass: selecting from the database the data reuired in the uery; decomposition of a uery for subueries (tass), the number of which is determined by the number of reuired database servers; selecting optimal access ways to the reuired database servers; establishing logical connections with the nodes of the computing system on which the reuired database servers are located. Transferring subueries to the reuired database servers via communication channels. Servicing subueries with database servers. Transferring the blocs of selected data from the database servers to the server of the node that initiated the reuest via communication channels. Assembling blocs of the data into an array reuired in the uery. 582

Evaluation criteria, as a tool of designing a DDB, are necessary for choosing a rational database structure among several alternative possibilities. It can be said that most of the problems and failures in developing a database model arise from a vague idea of what is meant by optimal database design. At present, as well as in the nearest future, uncertainty in the choice of these criteria will remain the weaest lin in the development of a database model. 4. Time, cost and volume characteristics of DB functioning Difficulties in determining the criteria for choosing alternative solutions are mainly caused by two factors [5]: 1. The first problem is that almost infinite number of different database structures that satisfy the same set of system reuirements can be built. Selection criteria should allow differentiation of all currently available alternatives. 2. The second problem is that alternatives are difficult to evaluate, since the criteria have the property of sensitivity and the duration of the various criteria is different. When solving the problem of the synthesis of the LS of DDB, the following basic time, cost and volume characteristics of the functioning of the DDB are used. The main time characteristics are: the duration of the implementation of a given set of ueries T and the duration of the implementation of a given set of transactions Т. In sum, these two indicators give the total duration of the worload of the DDB: where p T p p1 s T p T s1 T - the duration of the implementation of the p-th user uery; implementation of the s-th transaction. The components of the times T p and s, Ts - the duration of the T s depend on many factors: the networ and telecommunications euipment used, the technical parameters of the servers, the networ and telecommunications system-wide software, the characteristics of the relational database management system (RDBMS), the bandwidth of the communication channels, the running times of programs at various levels of networ protocols, etc. The main cost characteristics of a DDB are: the cost of storing information in the DDB E xp ; the uery and transaction costs at a given time interval E ; the cost of transmitting information via communication channels E. The sum of these components determines the total cost of the functioning of the DDB: E E E E. xp The cost of storing information in the DDB is determined by the physical amount of information V xp and the cost of storing a unit of information (one logical record) on the server xp. If we accept that the cost of storing information in all nodes of the computing system is a constant value, then E V xp In practice, xp approximately euals to 1.2-1.5. And finally, the volumetric characteristics of the functioning of the DDB, the sets of user ueries and transactions are calculated from the following assumption: the amount of memory occupied by one instance of the record euals to the sum of the amount of useful information and the product of length of one address reference and the number of pointers in the record. xp xp. 583

5. Conclusion The problem of constructing an optimal logical structure of DDB is reduced to the problem of uadratic integer programming. As a result of its solution, the local networ of the DDB is decomposed into a number of clusters that have minimal information connectivity with each other. Further research will be related to the design of an algorithm for the synthesis of the optimal logical structure of a distributed database. The algorithm will consist of two interrelated stages.at the first stage, it is supposed to solve the problem of distribution of clusters of a distributed database between a server and clients, then the solution of the problem of optimal distribution of data groups of each node to logical record types follows. At the second stage of the algorithm, the problem of localizing the database by computer networ nodes is solved, and in addition to the results of the first stage, the characteristics of the distributed database are taen into account. References [1] A. V. Vaniachine LHC Databases on the Grid: Achievements and Open Issues The Proceedings of the IV International Conference on Distributed computing and Gridtechnologies in science and education (Grid21), JINR, Dubna, Russia, 28 June - 3 July, 21. [2] Radovan Chytrace, Dir Düllmann, Giacomo Govi, Alander Kalhof, Zsolt Molnár, Andrea Valassi Distributed Database Access in the LHC Computing Grid with CORAL IEEE Nuclear Science Symposium conference record. Nuclear Science Symposium, 28. [3] D. Ciangottini, J. Balcas, M. Mascheroni, E. A. Rupeia, E. Vaandering, H. Riahi, J. M. D. Silva, J. M. Hernandez, S. Belforte, T. T. Ivanov A comparison of different database technologies for the CMS AsyncStageOut transfer database J. Phys.: Conf. Ser., 898, 4248, 217. [4] V. Kotliar, V. Anshuov, V. Ezhovac, V. Gusev, A. Kotliar, G. Latyshev, A. Shishov Development of the active monitoring system for the computer center at IHEP, CEUR-WS, ISSN 1613-73, 217. [5] N. A. Kuznetsov Methods of analysis and synthesis of modular information management systems. - M.: Fizmatlit, 8 p., 212. 584