Simulation of a cost model response requests for replication in data grid environment

Size: px

Start display at page:

Download "Simulation of a cost model response requests for replication in data grid environment"

Kevin Singleton
5 years ago
Views:

1 Simulation of a cost model response requests for replication in data grid environment Benatiallah ali, Kaddi mohammed, Benatiallah djelloul, Harrouz abdelkader Laboratoire LEESI, faculté des science et technologie Université d adrar Algérie Abstract Data grid is a technology that has full emergence of new challenges, such as the heterogeneity and availability of various resources and geographically distributed, fast data access, minimizing latency and fault tolerance. Researchers interested in this technology address the problems of the various systems related to the industry such as task scheduling, load balancing and replication. The latter is an effective solution to achieve good performance in terms of data access and grid resources and better availability of data cost. In a system with duplication, a coherence protocol is used to impose some degree of synchronization between the various copies and impose some order on updates. In this project, we present an approach for placing replicas to minimize the cost of response of requests to read or write, and we implement our model in a simulation environment. The placement techniques are based on a cost model which depends on several factors, such as bandwidth, data size and storage nodes. Key Words: response time, query, consistency, bandwidth, storage capacity, CERN. University of Nizwa, Oman December 9-11, 2014 Page 113

2 1. INTRODUCTION Data availability is a critical issue for all organizations in the world. Thus we are faced with a changing information and the internet ever generate data streams increasingly complex systems. This has led the scientific community to think about storage technologies, access and processing information data. Advances in telecommunications have made possible the reunion of a multitude of computers interconnected by the network to cooperate these geographically distributed resources; That was the birth of computing grids. It therefore seems impossible to recent stoker on a single machine and therefore often use a data grid. Grid computing is an important mechanism to manage IT resources placed in remote locations and linked to consumers through transmission data sites. The idea is to link heterogeneous storage resources and distributed so that, for the user, they appear as a single entity[3]. One of the first reasons to use data grids[4, 6] comes from the applications using large data sets[1, 8], for example, in high-energy physics[7] or science of life[2]. However, the distribution, large scale, a data grid and dynamicity of its sites respectively pose the problem of remote access and data availability. These parameters are extremely important in the context of data grids where the cost is very short and user access is frequent. The use of replication techniques is a very important action for access to shared data. The cost of access to data has a direct influence on the response time of the client. One problem with the use of replication techniques is the choice of the replica at the end to minimize the cost of response user queries. Our goal is to propose an approach called Good customer + Closer common to minimize the cost of query response for replication in data grid environment, which has a hierarchical topology grid type CERN. 2. MATERIALS AND METHODS 2.1 Topology of the grid: Our job is to provide a cost model for replication in a data grid. Our choice was for a grid topology CERN Type for several reasons: - CERN is a real grid widely used. - CERN is hierarchical. - The design of CERN is simple. University of Nizwa, Oman December 9-11, 2014 Page 114

3 - The number of levels in CERN is fixed (five levels). CERN structure is illustrated in Figure 2.1 Pn: The immediate parent of a node n Rd : The set of nodes that contain a replica of the data d BP(n): Bandwidth between nodes n and pn Size(d): Size of the data d Path d (n1,n2): All nodes encountered along the way node n1 to node n2, except node n2 CT d (n1, n2): Transfer cost of a given node n1 to node n2 CTL d : Cost processing the read operation on the data d CTE d : Cost of processing the write operation of the data Figure 2.1 Logical topology of the grid used by CERN [5] 2.2 The parameters used in our cost model: The proposed approach in our cost model replication is designed as an optimization problem of replica placement, which minimizes the average cost of query response generated by customers in a data grid, by based on the following parameters: CRR i : Cost of query response i CM: Average cost of all queries When a customer asks for a given, it will be served by the closest one node containing a replica of the data. We will calculate the cost of response to a request seeking access to the data of which is the sum of the cost of transferring data to and cost of University of Nizwa, Oman December 9-11, 2014 Page 115

4 calculation processing operation (read or write). The cost of transfer of data from the node n1 to node n2 is calculated as follows: CT d (n1, n2) = Size(d) + BW(n), n Path d (n1,n2) (2.1) To our cost model achieve their goal and since our simulation is static, the cost of processing the read operation on the data d (CTL d ), the cost of processing the write operation of the data (CTE d ) and the size of the data d (Size (d)) are determined by the simulator with the ability to change them. Also assume that all nodes in the same level have the same storage capacity, and bandwidth between all nodes of the same level n and the nodes in level n +1 is fixed. Our simulator provides an opportunity for users to enter the storage capacity of the nodes and the bandwidth between a node and another. Therefore the cost of response to a request from a client i (node) n1 requesting a read operation on the data d in a node n2 is calculated using the following formula: CRR i = CT d (n1, n2)+ CTL d (2.2) The same applies to the cost of response to a request from a client i (node) n1 requesting a write operation on the data d in a node n2: CRR i = CT d (n1, n2)+ CTE d (2.3) And therefore, we can estimate the average cost of all queries: CM = CRR i / total number of queries (2.4) We will calculate the various costs of responding to a request (CT d (n1, n2), CRR i, CM,...) per unit time. 2.3 Phases of handling our simulator Our simulator reproduces the components of an actual gate. It is thus possible to model a gate by performing the following steps: - Configure the grid: To configure the grid, should be given storage capacity of grid nodes and the bandwidth between these nodes to the grid that must be configured by user or by default. - Placement of data without replication: After configuring the grid, we move to the placement of data in the nodes of the grid, so we entity "data" which can be set by default or by user, then we generate queries per user or randomly to calculate the response cost of these queries. - Data replication: This phase includes data replication already placed on the grid, using same queries generated in the placement University of Nizwa, Oman December 9-11, 2014 Page 116

5 phase data without replication, the cost calculation and display the results. In this phase, our simulator can replicate random data and also allows the user to replicate according to their choice. The replication rules are the same investment rules, except that the phase of replication can replicate each data across multiple nodes. After data replication, it can t generate queries, but the same queries generated and saved in the placement phase in order to make comparisons between the results obtained in the two phases are used. The last step in this phase provides the data paths, costs and results display different queries. - Reconciliation of data to the right customer: This phase allows to link data to the right customer. The right customer(s) is the customer(s) has a large number of requests for a given relative to other clients that request the same data. To do this, use the same queries generated previously to select the right customers of each data. The approximation of a given good customer is to create a copy of the data on the working group (immediate father) to which it belongs. After the approximation of the data to the right customers, we calculate the cost of these queries and we show the results. Our approach is to approximate the data not only for good customers, but also to the nearest common customers that require the same data. For this we use the same queries to identify previously generated for each data customers that require reading or writing on this data. After the reconciliation process, we calculate the cost of these queries and we show the results alongside the results of the previous phases to make a comparison between all approaches. - Reconciliation of data + good customer to the nearest common: The average cost of queries generated is different depending on the number and placement of data on the grid. This phase is used to display the result of the average cost of investment data without replication, with replication, data reconciliation and rapprochement to the right customer data simultaneously to the right customer and the nearest common. The display average cost of our approach beside the other methods of investment allows us to demonstrate that our approach gives in most cases a minimum average cost. University of Nizwa, Oman December 9-11, 2014 Page 117

6 Coût (Unité de Temps) THE INTERNATIONAL ARAB CONFERENCE ON INFORMATION TECHNOLOGY (ACIT2014) 3. Results To demonstrate the effectiveness of our approach given, we chose a sample application which gave us the following result: 8, , , , , , , Résultats de Simulation Figure 3.1 The average cost of each approach in our example Indeed, this result clearly shows that our approach "good customer + nearest common point" for this example gives a minimum average cost. Approche We concluded after multiple tests that whenever the grid was large and it has several replicas and the number of requests generated by clients widely distributed on the grid is too large, we found that the average cost of our proposed approach is more small compared to other approaches(the difference between the average cost of our approach and the good customer of up to 25%). 4. Conclusion Our simulation results show the cost saving response to a query in our approaches (good customer + nearest common point) compared to other approaches (without replication, with replication, good customer) because our investment strategy of the data is priority to good customers without adversely affecting the other clients that request the same data. In future work, we expect to refine our solution to make it more flexible. As future works, several axes can be identified: - take into account the size of the request made by a customer on a given cost and propagation of updates to the data to the other nodes in the calculation of cost of response - use protocols to maintain consistency of replicated data. Take into account the computing capacity of each site in the decision of placement of replicas. It is not advantageous to place a replica of a University of Nizwa, Oman December 9-11, 2014 Page 118

7 given too much to ask on a site limited capacity. - Studying in parallel the problem of the number of replicas that can be hosted on a site based on the available storage space and computing capacity of the site. - Do more testing with other designs and other configurations (varying the number of nodes in each level of the grid, change the storage nodes, change the bandwidth, etc..) Can show us' other contributions of strategy we proposed. - Implement our model in another simulation environment. - Implement this approach on a real grid. References [1] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury et S. Tuecke. The Data Grid : Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets. Journal of Network and Computer Applications, 23: , [2] A. Krishnan. A Survey of Life Sciences Applications on the Grid. New Generation Computing, 22: , [3] A. Vernois. Ordonnancement et réplication de données bioinformatiques dans un contexte de grille de calcul. PhD thesis, Ecole Normale Supérieur de Lyon. Parallel Computing laboratory, [4] F. Berman, G.C. Fox et A.J.H. Hey, editeurs. Grid Computing : Making the Global Infrastructure a Reality.Wiley, [5] [6] I. Foster et C. Kesselman, editeurs. The Grid 2 : Blueprint for a New Computing Infrastructure. Morgan Kaufmann publisher, [7] M. Karlsson and M. Mahalingam, "Do we need replica placement algorithms in Content delivery networks?". in proceeedings of the [8] X. Qin et H. Jiang. Data Grid : Supporting Data-Intensive Applications in Wide-Area Networks. Rapport de recherche TR , University of Nebraska-Lincoln, Lincoln, NE, USA,Mai University of Nizwa, Oman December 9-11, 2014 Page 119

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last