Data warehouse access using multi-agent system

Size: px

Start display at page:

Download "Data warehouse access using multi-agent system"

Donald Wilkinson
5 years ago
Views:

1 Distrib Parallel Databases (2009) 25: DOI /s Data warehouse access using multi-agent system Nader Kolsi Abdelaziz Abdellatif Khaled Ghedira Published online: 21 February 2009 Springer Science+Business Media, LLC 2009 Abstract The new approach that we will propose, in this paper deals with the dynamic data distribution of the data warehouse (DWH) on a set of servers. This distribution is different from the classical one which depends on how data is used. It consists in distributing data when the machine reaches its storage limit capacity. The proposed approach insures the scalability and exploits the storage and processing resources available in the organization using the DWH. It is worth noting that our approach is based on a multi-agent model mixed with the scalability distribution proposed by the Scalable Distributed Data Structures. Our multi-agent model is made up of stationary agent classes: Client, Dispatcher, Domain and Server, and a mobile agent class: Messenger. These agents collaborate and achieve automatically the storage, splitting, redirection and access operations on the distributed DWH. In this paper, we focus on the global dynamic for the data access operation and we present the inherent experimental results. Keywords Data warehouse Dynamic distribution Data access Multi-agent system Mobile agent Scalable and distributed data structures Communicated by Ladjel Bellatreche. N. Kolsi ( ) Higher Institute of Business Administration of Sfax, Sfax, Tunisia nader.kolsi@fsegs.rnu.tn A. Abdellatif University of Sciences of Tunis, Tunis, Tunisia abdelaziz.abdellatif@fst.rnu.tn K. Ghedira National School of Informatics Sciences, Manouba Campus University, Tunis, Tunisia khaled.ghedira@isg.rnu.tn

2 30 Distrib Parallel Databases (2009) 25: Introduction The data warehouse (DWH), as defined by its inventor W.H. Inmon [19], is a collection of data which are subject-oriented, integrated, stamped, non-volatile, and used as a support of decision making. It is considered as a deposit of data that have been collected from heterogeneous and autonomous distributed sources. It is used for analytical tasks in business. The DWH usually contains a very large amount of data. This is because of the scope of the period that the DWH must cover (historical data) and the diversity of data sources from which data are extracted. The DWH is a principal component of the information systems in the organizations. In fact, it is the subject of many research works. This research deals with five main parts as shown in [29]: (1) data warehouse modeling and design, (2) data warehouse architectures, (3) data warehouse maintenance, (4) operational issues, and (5) optimization. Our research focuses on the operational issues and optimization topics mainly, but also data warehouse architectures and design. Our work aims at solving the problems of storage space and performance through: (1) developing a dynamic system that can manage the DWH automatically (data storage, data distribution on a set of servers, and data access), (2) taking advantage of the storage and processing resources available in the organization (processors, memory, hard disks, etc.), (3) getting better data storage time, and (4) improving the query response time. This paper is organized as follows: Sect. 2 gives an overview of related works and discusses the problems related to optimization topics. In Sect. 3, we present the multi-agent system. In Sect. 4, we describe the proposed multi-agent model. Section 5 details the global dynamic of the data access operation. In Sect. 6, the inherent experimental results are revealed. Finally, in Sect. 7, a conclusion and an outlook to future works are made. 2 Related works So far, distribution of data warehouses has not attracted much attention in research. The use of DWH with distributed structure has appeared only with the data marts [11, 18]. Although the use of small data marts (data warehouses) was the first attempt to solve the problems of space and performance, data marts are basically stand-alone and have data integration problems in a global data warehouse context. In addition, the performance of many distributed queries is normally poor, mainly due to the load balance problems. Furthermore, each individual data mart is primarily designed and tuned to answer the queries related to its own subject area, whereas the response to global queries depends on the global system tuning and the network speed. So most of researches in literature working in optimization topics propose solutions based on a centralized DWH or a model of partitioning which consist of storing the facts-table in pieces, instead of a large monolithic object on a set of I/O devices

3 Distrib Parallel Databases (2009) 25: with multiprocessors machine or on a centralized database. The latter is very expensive because of the large setup costs, and it is not very flexible due to its centralized nature [5]. In these researches, several queries optimization techniques are proposed. These techniques can be classified in two categories [2]: redundant structures: as materialized views and indexes [3, 15, 22]. These techniques compete for the same resource representing the storage cost and incur maintenance overhead in the presence of updates [28], non-redundant structures: as horizontal partitioning [9]. These techniques do not require an extra space as those in the first category. All these techniques are supported by the current database management systems (DBMS). The improvements, which are provided to these systems and concern the management of large data amount, are not sufficient to satisfy the needs due to the data amount growth of the DWH. In addition, the static data fragmentation schema, actually used in these systems, constitutes a major handicap. It is worth noting that, in our approach, we use the two techniques mentioned above (non-redundant and redundant structures). The horizontal partitioning technique will be used to distribute the data warehouse on a set of machines. The materialized views and indexes will be used on each individual machine that must be tuned and optimized for performance. Obviously, most researches, in the literature, that work on the data warehouse distribution propose solutions based on the studies made on the production databases under the name of very large data bases. These solutions are based on the classic data distribution which depends on the data use and has a static distribution plan. Furthermore, this type of distribution is defined at the design phase. In [9], the authors propose a solution to make this distribution plan dynamic. They present an algorithm to find the optimal vertical schema fragmentation based on the particle swarm optimization. Other researches [30, 31] use the abstract state machines [7] as a flexible and quality-oriented formal method to design and optimize a distributed DWH and OLAP (On Line Analytical Processing) applications. We have to point out that, in our approach, the data distribution that we consider is different from the usual-used ones [21]. In fact, it is not defined at the design phase. However, it is imposed by the storage capacity. As a matter of fact, when a machine reaches its storage capacity limit, we add another one. Then, we distribute the data on the two machines to have a balanced load. There are several ways to divide horizontally the relation. Typically, we can assign tuples to the processors in a round-robin fashion (round-robin partitioning), we can use hashing (hash partitioning), or we can assign tuples to the processors by ranges of values (range partitioning) [5]. In [5, 6, 8, 14], the papers authors use the Data Warehouse Striping (DWS) technique. The latter is a round-robin data partitioning approach especially designed for distributed data warehouse environments. By using the DWS, the fact table will be distributed into an arbitrary number of machines which is fixed at the beginning. Consequently, the queries will be executed in parallel by all of the machines [8]. The round-robin distribution is simple to use and guaranties the load balancing, although its major disadvantage is that we must have machines with

4 32 Distrib Parallel Databases (2009) 25: the same treatment and storage capacities. Otherwise, some machines will be too busy and the others will be under used. We have to note that, in our approach, we use the range partitioning applied by the scalable and distributed data structures (see Sect. 3). So, the queries are executed in parallel not by all the machines but only by those that contain the necessary partitions. Furthermore, the data distribution is dynamic and automatic. In fact, at each time when one machine reaches its limit capacity, it starts up the data distribution operation without needing an external intervention (administrator). Moreover, the number of used machines, in our approach, is not fixed. Therefore, the storage capacity of the DWH tends theoretically to the infinite because we can, at any moment, add dynamically other machines. In the following section, we present the scalable and distributed data structures principle. 3 Scalable and Distributed Data Structures The Scalable and Distributed Data Structures (SDDS) deal with the storage of a large data amount on a set of interconnected machines. The SDDS principle consists in distributing the file contents in a way that allows us to benefit from the available memory on a set of interconnected machines [4, 10]. This distribution is based on the identifiers (keys). In fact, the latter residing in one machine must be included between a lower bound mark and a higher one (see Sect. 5.1). The increasing content of the file involves its splitting. This principle has been extended from files to operational databases [24, 26, 27]. The infinite storage capacity and dynamic data distribution are guaranteed by the principle of the SDDSs [23]. In the rest of this paper, we consider that the two terms splitting and distributing have the same significance. In the following section, we present the multi-agent system concepts. 4 Multi-agent system The agent paradigm is currently in vogue within a lot of research domains. An agent can be a physical or virtual entity that acts autonomously (without the direct intervention of humans or others), on behalf of entities (person, organisation, etc.), in response to input from his environment. Agents have a social ability. They may communicate with the users, system resources and other agents as required in order achieving its goals and tendencies. Moreover, more advanced agents may cooperate with other agents to carry out tasks beyond the capability of a single agent. So, agents contain some level of intelligence, ranging from pre-defined rules up to self-learning artificial intelligence inference machines. This intelligence enables agents to act not only reactively, but sometimes also proactively. An agent can be static or mobile. The latter is a particular class of agent with the ability during execution to migrate dynamically (code, data and execution state) from one machine to another, where it can resume its execution, in order to reach data or

5 Distrib Parallel Databases (2009) 25: remote resources. It has been suggested that mobile agent technology, amongst other things, can help to reduce network traffic and to overcome network latencies [17]. Moreover, the mobile agents have proved a high performance when we access to the data distributed on a set of interconnected machines [1] and when we store these data [20]. A MAS is a system composed of multiple autonomous agents and comprises the following elements [13]: 1. An environment E is a space which generally has volume. 2. A set of situated objects O, that is to say, it is possible at a given moment to associate any object with a position in E. 3. An assembly of agents A, which are specific objects (a subset of O ), represent the active entities in the system. 4. An assembly of relations R, which link objects (and therefore, agents) to one another. 5. An assembly of operations Op, that allows the agents of A to perceive, produce, transform, and manipulate objects in O. 6. Operators with the task of representing the application of these operations and the reaction of the world to this attempt at modification, which we shall call the laws of the universe. The following section reveals the data distribution principle and the proposed multi-agent model. 5 Proposed model The aim of our proposed model is to solve the problems in the DWH context using the available resources in the organization. These problems are related to the data storage, splitting and access. According to the proposed approach, the DWH will be distributed on a set of machines. In this case, the data management needs the collaboration and the interaction between those machines in order to reply to the user s queries while assuring the parallel processing of these queries. Thus, we have chosen to use the Multi-Agent System (MAS) with the mobile agents as essential actors. In fact, the MAS allows following the progress of the dynamic data distribution, facilitates the collaboration, the interaction, and the independency of the different machines, and improves the parallel execution of the user queries. The use of mobile agents in the proposed solution seems to be very helpful because it allows: (1) decreasing the network loads, (2) liberating client machines during the results preparation that needs generally a very important execution-time, (3) and, essentially, securing the data that are transported in the network (see Sect. 6). We use the SDDS principle based on data distribution through intervals (range partitioning) in order to distribute the data of the DWH on a set of machines. This type of distribution allows the decomposition of the DWH into a set of domains. Each domain can be stored on one or more machines according to its data size.

34 Distrib Parallel Databases (2009) 25: 29 45 5.1 Principle of data distribution The DWH is horizontally distributed on a set of machines that have the same DBMS and the same star schema (see Fig.

6 34 Distrib Parallel Databases (2009) 25: Principle of data distribution The DWH is horizontally distributed on a set of machines that have the same DBMS and the same star schema (see Fig. 1). Furthermore, on each machine, we can use the materialized views and indexes to tune and to optimize the performance. The principle is to start with a single machine for which we define: (1) the storage capacity limit of this machine for which the used DBMS gives its highest performance (for data access and storage), and (2) both the inferior bound mark and the superior one for each fact table key. When this machine reaches its limit, we add another one and we distribute the data on the two machines to obtain a balanced load. In most cases, the fact table undergoes the splitting operation, because of its important volume. The dimensional tables are distributed when their key constitutes a distribution criterion. Otherwise, they are duplicated. In Table 1, we present a scenario of data splitting. Machine 1 starts up the first splitting operation when it reaches its capacity storage limit. First, we search for the key value that gives two balanced partitions (e.g. Product_Id that is an integer of two numbers). Then, we move the data, related to the new interval, to machine 2. Finally, we update the intervals. The second splitting operation is launched by machine 2 (e.g. Date_Id that is a date). The same process is restarted when one machine reaches its limit capacity. In fact, the data distribution can be continued according to the same criteria or to other ones (Customer_Id, Region_Id). We notice that each SALE table record belongs to only one DWH partition. If we consider that each of these DWH partitions is stored in separate databases, we must, on the one hand, split the Date table and Product table according to the same criteria used for the SALES table. On the other hand, we duplicate the other tables in order to (1) facilitate the checking of the integrity constraints, (2) ensure the databases autonomy, and (3) improve the join time when we access to data. Fig. 1 Distributed data warehouse Table 1 Splitting scenario Start First splitting Second splitting... Machine 1 M1 M2 M1 M2 M3... Customer Id [A, Z] [A, Z] [A, Z]... [A, Z] [A, Z]... Production Id [0, 99] [0, 50] [51, 99] [51, 99] [51, 99] Region Id [AA, ZZ] [AA, ZZ] [AA, ZZ] [AA, ZZ] [AA, ZZ] Date Id [Jan, Dec] [Jan, Dec] [Jan, Dec] [Jan, Jun] [Jul, Dec]

7 Distrib Parallel Databases (2009) 25: The following part deals with the proposed multi-agent model architecture and the waiting database notion that we use in our approach. 5.2 The proposed multi-agent model The proposed model consists of five static agent classes (Client, Dispatcher, Splitting, Domain and Server) and a mobile agent class (Messenger). Each agent class is defined by its knowledge (static or dynamic), its acquaintances (agents that it knows and with which it can communicate), and its behavior [12]. Figure 2 illustrates the interaction between the different agents. The Client agents act as an interface between the user and the DWH management system (Dispatcher agent). In fact, the user utilizes the Client agent to send the data storage and the data access operations (queries) to the Dispatcher agent. Each Client agent has the Dispatcher agent as an acquaintance. Its static knowledge is made up of its name and its address. This agent class does not have dynamic knowledge. The Dispatcher agent arranges the received operations according to their arrival order. These operations will be treated by the Messenger agent. When the Dispatcher agent receives the operation results from the Messenger agents, it sends them to the Client agent, if the latter is connected. Otherwise, it saves them until the Client agent will be connected again. The acquaintances of the Dispatcher agent are: (i) the Client agents which send queries, (ii) the Messenger agents which take charge of executing these operations, and (iii) the Splitting agent. Its static knowledge consists of its name and its address. Its dynamic knowledge is made up of a list containing all the Domain agents existing in the system and two waiting queues. The first queue is used to store operations received from the Client agents. The second one is used to store the results provided by the Messenger agents. Then, the Dispatcher agent sends these results to the sending Client agent (as it is described above). The Messenger agents take charge of executing each operation found in the operations waiting queue of the Dispatcher agent. Each Messenger agent makes the Fig. 2 The proposed multi-agent model architecture

8 36 Distrib Parallel Databases (2009) 25: execution plan of this operation. Then, it visits all the Domain agents concerned with this operation. Finally, it gives the ultimate results to the Dispatcher agent. Each Messenger agent has as acquaintances the Dispatcher agent and the Domain agents necessary to execute the operation. Its static knowledge is made up of its name and its maximum size of data that it can transport. This maximum depends on the network characteristics. The Messenger agent dynamic knowledge consists of: (i) the list of Domain agents to visit for executing the operation, (ii) the operation to execute, (iii) the lists of data to store (if the operation is data storage), or the list of data that are collected from visited Domain agents (if the operation is data access), and (iv) the size of transported data. It has a very important role in our architecture because it allows: (1) reducing the message traffic on the network, (2) accelerating the data storage and access operations, and, essentially, (3) securing the data circulation on the network (see Sect. 6). The Domain agents are responsible for sending the operations to the Server agents which they control. Then, they collect the replies sent by the Server agents and transmit the final result to the Messenger agent. The Domain agent has as acquaintances: (i) the Server agents that are under its control, (ii) the Messenger agents with which it has operations to execute and (iii) the Splitting agent. Its static knowledge is composed of its name, its address, the disk space limit of each Server agent, the maximum number of Server agent it can manage and the maximum size of data it can receive from the Messenger agents. This maximum depends on the machine characteristics (memory, processor, etc...). Its dynamic knowledge consists of the descendant list, the size of memorized data, and two waiting queues. The first queue is used to store the operations brought by the Messenger agents. The second one is used to store the replies sent by the Server agents. Later on, the Domain agent sends them to the appropriate Messenger agent. The Server agents undertake the received operations and send the replies to the Domain agent. Each Server agent has the Domain agent to which it belongs as acquaintances. Its static knowledge is made up of its name and its address. Its dynamic knowledge is a waiting queue used to store the operations received from the Domain agent. The Splitting agent is responsible for the splitting operations and the maintaining of the data road card that allows finding the data location. The splitting operation is started up when the machine reaches its storage capacity limit. The role of this agent consists in the following steps. First, it creates a new Domain agent when it receives a splitting request. Then, it informs the Domain agent, asking for splitting, of the location and the characteristics of the new one. Finally, it sends to the Dispatcher agent the new information concerning the two Domain agents in order to update the Domain agents list. The Splitting agent has as acquaintances the Dispatcher agent and the Domain agents that ask for splitting. Its static knowledge consists of its name and its address. Its dynamic knowledge is the list of splitting requests sent by the Domain agents. The Dispatcher agent manages a metabase which allows it to follow the evolution of the data distribution on the Domain agents, the network status and the Messenger agents load rate (see Fig. 3). This metabase is also used by the Messenger agents to make the execution plans of the received operations and determine the Domain agents

Distrib Parallel Databases (2009) 25: 29 45 37 Fig. 3 Agent MetaBases tables to visit.

9 Distrib Parallel Databases (2009) 25: Fig. 3 Agent MetaBases tables to visit. The Splitting agent, also, uses this metabase for the splitting operations and updates it at the end of each splitting operation. Furthermore, each Domain agent has an appropriate metabase in order to follow the evolution of the data distribution on its descendants (Server agents) (see Fig. 3, the framed tables). In the following section, we detail the dynamic of the proposed model for the data access operation. 6 Multi-agent dynamic for the data access operation The proposed model is designed to support the different management operations of data warehouse, namely the data storage, splitting, redirection and access. In this paper, we present only the data access operation and we will not consider the case where the system is interrupted. The sequence diagrams presented later describe both the interactions and the agent behaviors made to accomplish the data access operation. The formalism used to represent these diagrams is the MA-UML (Mobile Agent UML) [16], which is an extension of AUML (Agent UML allows modeling the mobile agent behaviors). In this operation, the used agents are: the Client agents, the Dispatcher agent, the Messenger agents, the Domain agents, and the Server agents. These agents exchange different messages in order to accomplish the data access operation. This exchange is shown in the diagram presented in Fig. 4. The data access operation is started up when the users submit their queries to the Client agents. These latter sent them to the Dispatcher agent. The Client agent is satisfied when receiving a result for each sent query. Otherwise, it sends again the query to the Dispatcher agent, eventually, if the query contains any syntax errors, it requests user to correct them. When receiving the queries, the Dispatcher agent assigns each query to a Messenger agent. If no Messenger agent is available, the Dispatcher agent creates one for

10 38 Distrib Parallel Databases (2009) 25: Fig. 4 Data access operation each query. The Dispatcher agent is satisfied when receiving a result for each query. This result will be sent to the appropriate Client agent. If this latter is not connected, the Dispatcher agent places the received result in its results queue. The Domain agent is unsatisfied, if the Messenger agent informs it that there are any syntax errors. In this case, the Dispatcher agent, in its turn, informs the Client agent which sending the query. The Messenger agent is in the charge of the query execution. When receiving the query, it determines the list of Domain agents containing the data replying to the query. The Messenger agent uses the available information in the metabase and the clause WHERE of the query, to determine these agents and their addresses. If this clause does not exist, the list will contain all the Domain agents in the system. The Messenger agent clone itself as much as the number of the visited Domain agents. Each cloned Messenger agent moves to one of the selected Domain agents. When it receives the reply from the visited agent, it returns to the original Messenger agent, sends it the query partial result and kills itself. The cloned Messenger agent is satisfied when receiving the reply from the visited Domain agent. If the query has a clause GROUP BY and/or a clause ORDER BY, the original Messenger agent creates a temporary table, corresponding to the query, to save the received partial results. When it receives all the partial results, the original Messenger agent executes the query on the temporary table to get the final result that will be sent to the Dispatcher agent and it drops the table. If the query does not have these two clauses, the original Messenger agent gathers the partial results and then sends the final result to the Dispatcher agent. The original Messenger agent is satisfied when all the cloned Messenger agents return with the partial results. When receiving the query from the cloned Messenger agent, the Domain agent verifies whether the data requested by the received query belongs to the Server agents

Distrib Parallel Databases (2009) 25: 29 45 39 which are under its responsibility. If this condition is true, the Domain agent sends the query to the appropriate Server agents.

11 Distrib Parallel Databases (2009) 25: which are under its responsibility. If this condition is true, the Domain agent sends the query to the appropriate Server agents. Otherwise, the Domain agent forwards this query to the right Domain agent. The last case occurs when a splitting operation happens before the query arrival. The Domain agent is satisfied when receiving the results from all the Server agents. These results will be sent to the cloned Messenger agent. The Server agent executes the query and sends the obtained result to the responsible Domain agent. It is satisfied when replying to all the received queries. If there are any syntax errors, the Server agent is unsatisfied and it informs the Domain agent. In the following section, we present the results obtained for the data access operation. 7 Experimental evaluation In order to validate our model for the data access operation, we have implemented three prototypes and we have measured the query execution time. One of them permits to access data on a centralized database (DB). The others allow accessing data on a set of machines. In fact, as described below, we have made the experiences using one machine that sends the query and N (three then five) machines that contain the DWH partitions. These machines have the same configuration: P4 and 256 Mo (RAM). We have used JDeveloper10g as a development toolkit, Oracle as a DBMS, and IBM Aglets as a multi-agent platform. We have programmed an engine that inserts the data in DWH partitions. In the first prototype, we have used two machines (Client/Server) and we have programmed an engine which accesses the data, stored on the server machine (centralized DWH), from the client machine. In the second prototype, we have programmed an access engine, without MAS, that accesses the data distributed on a set of machines (three then five machines) using the database links etc.). Each machine contains 1/N of the use data size. The given results (see Figs. 5 and 6) illustrate the aggregate functions (count, sum, avg, max, and min) with this type of query: Fig. 5 Experimental results with data size = 600 Mo without Group by/order by

40 Distrib Parallel Databases (2009) 25: 29 45 Fig. 6 Experimental results with data size = 2.1 Go without Group by/order by Fig.

From ((Select aggregate_function (sale_qt) s From Sales@dwh1) Union all (...)... Union all (Select aggregate_function (sale_qt) s From Sales@dwhN)); In the last prototype, we have programmed the MAS dynamic (see Sect.

12 40 Distrib Parallel Databases (2009) 25: Fig. 6 Experimental results with data size = 2.1 Go without Group by/order by Fig. 7 Experimental results with data size = 600 Mo with Group by/order by Select aggregate_function (s) From ((Select aggregate_function (sale_qt) s From Sales@dwh1) Union all (...)... Union all (Select aggregate_function (sale_qt) s From Sales@dwhN)); In the last prototype, we have programmed the MAS dynamic (see Sect. 6). In this prototype, the machines are used as follows: (1) on one of these machines we have made the Dispatcher agent, the metabase (MB), the Client agent and the Messenger agents, and (2) on each of the other N machines, we have made a Domain agent, a partition of the DWH database (DWHi) containing 1/N of the used data size, a MB and a Server agent. The query type used to get the given results (see Figs. 5 and 6)is: Select aggregate_function (sale_qt) From Sales; We have tested these prototypes using different data sizes: records equivalent to 600 Mo and records equivalent to 2.1 Go. We have also tested our model (see Figs. 7 and 8) using this type of query: Select... Group by region_id Order by region_id

Distrib Parallel Databases (2009) 25: 29 45 41 Fig. 8 Experimental results with data size = 2.

13 Distrib Parallel Databases (2009) 25: Fig. 8 Experimental results with data size = 2.1 Go with Group by/order by Table 2 Table of Average gains percentage compared to the centralized DWH Query Query without group by/order by with group by/order by = 600 Mo = 2.1Go = 600 Mo = 2.1Go MAS MAS Distributed DWH without MAS 3 Distributed DWH without MAS 5 In Table 2, we present the average gains percentage obtained when we compared the distributed prototypes to the centralized prototype. When we compare the time needed to execute queries, by the prototype using a distributed DWH without MAS to the time needed by the prototype using a centralized DWH, we remark, in most of cases, that the average gains are positive. This is explained by the facts that: (1) the query accesses only a small part of the fact table, and (2) we execute the query on the fact table part in parallel. These averages turn negative when we have a small data size distributed on a set of machines. In these cases, the data load time becomes sizeable in the query execution time. We note that the average gains given by our model are the best. These gains result from reducing: (1) the network load charge (the Messenger agent encapsulates the partial result) and (2) the communications between machines (each machine executes the query locally). In Table 3, we give the time needed to execute each query step on each used machine to demonstrate these gains. We take as examples the Avg query and the ALL functions query using 3 machines and data size = 2.1 Go.

14 42 Distrib Parallel Databases (2009) 25: Table 3 Time execution by query step by machine Avg query ALL functions query Without group by and order by time in ms (a) M1 M2 M3 M1 M2 M3 Query execution (ServA) Tuple transmission (MessA) MAS coordination Execution time (DomA) Total execution time With group by and order by time in ms (b) M1 M2 M3 M1 M2 M3 Query execution (ServA) Data grouping and sorting (ServA) Tuple transmission (MessA) Insertion partial results in temporary table (MessA) MAS coordination Execution time (DomA) Create the temporary table (MessA) Made the final result (MessA) Total execution time ServA = Server Agent, MessA = Messenger Agent, DomA = Domain Agent We note that the time needed to execute the query on each machine is equal to the time needed to execute the query on a centralized DWH (AVG query (a) = ms, AVG query (b) = ms, ALL query (a) = ms, ALL query (b) = ms) divided by three. In addition, the time required to transmit tuples, to coordinate MAS and to make the final result increase slightly when the number of returned tuples increases. This time is approximately 6500 ms. For the query without group by and order by clauses, when we distribute the data on 5 machines, the time of the query execution is reduced approximately by an average equal to 1000 ms. But, for the query with group by and order by clauses, the time

15 Distrib Parallel Databases (2009) 25: of the query execution is reduced approximately by an average equal to 4500 ms. And, for these two query types, the time needed coordinate MAS and to make the final result increase by approximately 2500 ms. This is why, the time obtained for the same queries, when using 5 machines, is as follow: AVG query (a) = ms, AVG query (b) = ms, ALL query (a) = ms, ALL query (b) = ms. Our model not only gives the best access time but it also secures the data circulation on the network. In fact, we have made a function that the cloned Messenger agent executes, at each time, when it reaches one machine. This function allows to the cloned Messenger agent to check whether the address of the reached machine belongs to its address list. If the address is not found, the cloned Messenger agent tries to leave this machine. If it cannot leave this machine, it destroys the transported data and kills itself. 8 Conclusion In this article, we have presented some researches that deal with the data distribution in the DWH context and the multi-agent system. Then, we have described our proposed multi-agent model and its global dynamic concerning the data access operation. Finally, we have demonstrated the improvements obtained when we have used the MAS and the Messenger agents in the data access operation. We can conclude that when the number of used machines increases the average gains given by our model increase. But, we have to note that the increase in the number of used machines is relative to data size and the query complexity. Otherwise, if we have a few data distributed on a big number of machines, the circulation time makes by the cloned Messenger agents becomes sizeable and the centralized DWH access will be more efficient. These results will be considered to perform the data splitting operation. For each query, we estimate the execution time if we distribute the data on two machines. If this time is less than the time made when data are centralized, we split data. As near future work, we will test our model with Benchmarks (TPC-H and APB-1) and we will compare the given results to those obtained in the literature. We will, also, implement the query redirection process. Another future direction is to study how to make our system robust enough to deal with the momentarily unavailability of one or more machines. References 1. Arcangeli, J., Hameurlain, A., Migeon, F., Morvan, F.: Apport des agents mobiles à l évaluation et l optimisation de requêtes bases de données réparties à grande échelle. Technical Report, laboratory IRIT, Université Paul Sabatier (2002) 2. Bellatreche, L., Boukhalfa, K.: An evolutionary approach to schema partitioning selection in a data warehouse. In: DAWAK 2005, Bellatreche, L., Schneider, M., Lorinquer, H., Mohania, M.: Bringing together partitioning, materialized views and indexes to optimize performance of relational data warehouses. In: Proceeding of the International Conference on Data Warehousing and Knowledge Discovery (DAWAK 2004), pp , September 2004

16 44 Distrib Parallel Databases (2009) 25: Bennour, F.: Les structures de données distribuées et scalables sous windows: tendance hachage linéaire. Doctoral Thesis U. Paris 9, Bernardino, J., Madeira, H.: Data warehousing and OLAP: improving query performance using distributed computing. In: 12th Conference on Advanced Information Systems Engineering. Stockholm, Sweden Bernardino, J., Furtado, P.S., Madeira, H.C.: Approximate query answering using data warehouse striping. J. Intell. Inf. Syst. 19(2), (2002) 7. Börger, E., Stärk, R.: Abstract State Machines. Springer, Berlin, Heidelberg, New York (2003) 8. Almeida, R., Vieira, J., Vieira, M., Madeira, H., Bernardino, J.: Efficient data distribution for DWS. In: Proc. of the 10th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 08), Turin, Italy, September Lecture Notes in Computer Science, vol Springer, Berlin (2008) ISBN Derrar, H., Boussaïd, O., Ahmed-Nacer, M.: Une approche de répartition des données d un entrepôt basée sur l optimisation par essaim particulaire. In: 4èmes journées francophones sur les Entrepôts de Données et l Analyse enligne (EDA 2008), Toulouse, Juin 2008; RNTI, vol. B-4, Cépaduès, Toulouse, pp Diene, Litwin, W.: Performance measurements of RP*: a scalable distributed data structure for range partitioning. In: Int. Conf. on Information Society in the 21st Century: Emerging Techn. and New Challenges. Aizu City, Japan, Informatica white paper. Enterprise-scalable data marts: a new strategy for building and deploying fast, scalable data warehousing systems. (1997) 12. Ferber, J.: Les Systemes Multi-Agents vers une Intelligence Collective. InterEditions, Paris (1995) 13. Ferber, J.: Multi-Agent System: An Introduction to Distributed Artificial Intelligence. Addison- Wesley, Longman, Harlow (1999) 14. Furtado, P.: Experimental evidence on partitioning in parallel data warehouses. In: DOLAP 04 WORKSHOP of the Int l Conference on Information and Knowledge Management (CIKM), Washington, November Gupta, H.: Selection and maintenance of views in a data warehouse. Ph.D. thesis, Standford University, September (1999) 16. Hachicha, H., Loukil, A., Ghédira, K.: MA-UML: une extension de A-UML aux agents mobiles. In: JFIADSMA 2002, Lille, French 17. Harrison, C.G., Chess, D.M., Kershenbaum, A.: Mobile agents: are they a good idea? Technical report, IBM Research Division (1995) 18. Hewlett-Packard white paper. HP Intelligent Warehouse. (1997) 19. Inmon, W.: Building the data warehouse. QED Technical Publishing Group (1992) 20. Kolsi, N., Abdellatif, A., Ghedira, K.: Agent based dynamic data storage and distribution in data warehouses. In: KES-AMSTA, Kolsi, N., Ghedira, K., Abdellatif, A.: Utilisation d un système multi-agents pour la répartition et la scalabilité des données d un data warehouse. In: Acts of the Fourth Scientific Days, Tome 1, pp , Borj El Amri Aviation School, Tunis, Tunisia, May Kotidis, Y., Roussopoulos, N.: Dynamat: a dynamic view management system for data warehouses. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp , June Litwin, W., Neimat, M.A., Schneider, D.: RP*: a family of order-preserving scalable distributed data structures. In: 20th Intl. Conf. On very Large Data Bases VLDB, Litwin, W., Risch, T., Schwarz, Th.: An architecture for a scalable distributed DBS: application to SQL server 2000, Extended abstract. In: 2nd Intl. Workshop on Cooperative Internet Computing (CIC 2002), Hong Kong, August, Narasayya, S.V.R., Yang, B.: Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp , June Ndiaye, Y., Diene, A., Litwin, W., Risch, W.: AMOS-SDDS: a scalable distributed data manager for windows multicomputers. In: ISCA 14th Intl. Conf. on Par. and Distr. Computing Systems, Texas, USA, August 8 10, Sahri, S., Litwin, W., Schwartz, T.: An overview of a scalable distributed database system SD-SQL server. In: Bell, D., Hong, J. (eds.) Flexible and Efficient Information Handling: 23d British National Conference on Databases, BNCOD 2006, Belfast, Northern Ireland, UK, July 2006 Proceedings. Lecture Notes in Computer Science, vol. 4942, pp Springer, Berlin, Heidelberg, New York (2006)

17 Distrib Parallel Databases (2009) 25: Surajit, S.C., Narasayya, V.R.: Automated selection of materialized views and indexes in microsoft SQL server. In: Proceedings of the International Conference on Very Large Databases, pp , September Wu, M., Buchmann, A.: Research issues in data warehousing. In: BTW 97, March Zhao, J., Ma, H.: Quality-assured design of on-line analytical processing systems using abstract state machines. In: Ehrich, H.-D., Schewe, K.-D. (eds.) Proceedings of the Fourth International Conference on Quality Software (QSIC 2004), Braun-Schweig, Germany, IEEE Computer Society Press, Los Alamitos (2004) 31. Zhao, J., Schewe, K.-D.: Using abstract state machines for distributed data warehouse design. In: Hartmann, S., Roddick, J. (eds.) Conceptual Modelling 2004 First Asia-Pacific Conference on Conceptual Modelling, Dunedin, New Zealand, CRPIT, vol. 31, pp Australian Computer Society, Sydney (2004)

Agent Based Architecture in Distributed Data Warehousing

International Journal of Scientific and Research Publications, Volume 2, Issue 5, May 2012 1 Agent Based Architecture in Distributed Data Warehousing Bindia, Jaspreet Kaur Sahiwal Department of Computer