d 3 d 4 d d d d d d d d d d d 1 d d d d d d

Proceeings of the IASTED International Conference Software Engineering an Applications (SEA') October 6-, 1, Scottsale, Arizona, USA AN OBJECT-ORIENTED APPROACH FOR MANAGING A NETWORK OF DATABASES Shu-Ching Chen School of Computer Science Floria International University Miami, FL 331 Mei-Ling Shyu School of Electrical an Computer Engineering Purue University West Lafayette, IN 40 Chi-Min Shu Department of Environmental Safety Engineering National Yunlin University of Science an Technology Yunlin, Taiwan, R.O.C ABSTRACT A large scale network may consist of hunres of isparate an autonomous atabases. Users in such an information-proviing environment usually access information from those atabases in the same or similar application omains so that there is no nee to hanle all the entities from all the atabases. That is, in most cases, only a subset of atabases is require for the users' requests. As the number of atabases increases, the nee to manage such a network of atabases increases. In this paper, we present a split/cluster approach using the object-oriente technique to allow users to incrementally an ynamically access the information they want without being overwhelme with all of the unstructure information. The approach is base on the anity relationships of the atabases an is performe recursively to split these atabases into clusters. Then a cluster hierarchy is forme to provie ierent levels of abstractions for the users. This framework provies a exible means of sharing information to all the atabases. Theoretical terms along with a running example are presente. Key wors: Object-oriente atabases, anity, clustering, splitting. 1. INTRODUCTION -060-1 - Integrating heterogeneous atabases is a challenging problem since incompatibilities exist among all the atabases [3]. To provie as transparent as possible a atabase schema, conicts nee to be resolve before it can provie a view to the users. A number of researchers have investigate the problem of integrating heterogeneous atabases [1] [] []. However, the issues of conict resolutions are not iscusse in this paper since we try to focus on managing the network of atabases to help users better utilize the information in the atabases. In such a large scale atabase network, queries ten to traverse ata relate to the same or similar application omains an which resie in ierent atabases. Most of the queries request information from a small fraction of the atabases in the network without the nee to show all the entities of all the atabases. This motivates us to split the network of atabases recursively into benecial clusters base on the access behavior of application queries. An example of grouping close to the concept of ours is the Internet [4]. The Internet is a computer network consisting of several connecte subnetworks. Every subnetwork follows its own communication protocols an is usually set up to serve some special purposes. One ierence between these two concepts is that all the subnetworks provie almost the same set of information; while each cluster in the propose approach can provie iverse sets of information. In this paper, an object-oriente split/cluster approach is propose. The object-oriente paraigm is aopte since things in the worl aroun us have properties or features; we can think of ata as an object class with its ening attributes. The anity measures between every pair of atabases are formalize an calculate base on the access behavior of application queries. Each query may be activate several times an hence each query has its access frequency. Therefore, the access frequency of a query per time perio shoul be taken into account in the anity measures. The splitting proceure is base on the anity relationships of the atabases an is performe recursively to split these atabases into clusters. After the split/cluster step, a cluster hierarchy is generate. The cluster hierarchy provies ierent levels of abstractions an hence allows users to incrementally an ynamically access the pieces of information they want without being overwhelme with all of the unstructure information. The constructe clusters can be use as the unit not only for query processing but also for iscovering the objectoriente relationships such as superclass, subclass, an equivalence relationships, which is the subject of a forthcoming paper. For those users who wish

to access only parts of the atabases, they can access the ata from the appropriate clusters without going through the whole network of atabases. In other wors, the propose approach provies a exible means of sharing information to all the atabases. This paper is organize as follows. In Section, the propose object-oriente approach with relative anity formulations an the split/cluster proceure is introuce. A simple example is given to illustrate the steps of the split/cluster proceure. Section 4 conclues this paper.. PROPOSED OBJECT-ORIENTED APPROACH A set of historical queries which are issue to the atabases in the network is use as a priori for the split/cluster proceure. We use the relative anity values to measure how frequently two atabases have been accesse together in the set of historical queries. Realistically, it cannot be expecte that the user applications are able to specify these anity values an hence formulas nee to be ene..1. Relative Anity Measures Let Q = fq1, q, : : :, q q g be the set of queries that run on the set of atabases D = f1,, : : :, g in the large scale atabase environment. Dene the variables: use i () = a vector of length q inicating the usage patterns of i with respect to all the queries in Q. For each atabase i, use i () is ene as follows an the kth entry of use i () enotes the usage pattern of i with respect to q k. use i () = 1 if object classes in i is accesse by q k 0 otherwise access() = a vector of length q inicating the access frequencies of the queries in Q per time perio. The kth entry of access() enotes the access frequency for query q k. rel(i, j) = P q k=1 use i(q k ) use j (q k ) access(q k ) = the anity value of atabase i an j. M=a matrix of size gg inicating the anity measures of the atabases in a group DB GROUP IJ with respect to all queries in Q assuming DB GROUP IJ = f1; ;... ; g g. The rel(i, j) value is place at the (i,j)th entry in M. Note that M is a symmetric matrix an for simplicity, only the entries which i j are compute. The (i,j)th entry will have the same result as the (j,i)th entry an the (i,i)th entry will not be use in the split/cluster proceure. -060 - - PP(i,j) = a closeness ierence function which calculates the closeness ierence between column i for i an column j for j. Let O represent a temporary matrix in the split/cluster proceure which contains the rst several columns of M. For every possible pair of neighbors i an j, PP(i,j) is e- ne as follows: PP(i,j)=M(1,i)-O(1,j) if i is put to the left of j PP(j,i)=O(1,j)-M(1,i) if i is put to the right of j.. Split/Cluster Proceure The objective of the split/cluster proceure is to n several clusters of atabases that are accesse together more frequently by the set of queries. For a large scale atabase environment, this split/cluster proceure shoul be invoke iteratively to form the cluster hierarchy. The split/cluster proceure takes the primary ata as inputs, computes the entities of the matrix M, calculates the closeness ierence values, permutes its columns, an then generates an upate matrix M. A function PP(i,j) is ene to calculate the ierence of two anity values of the nearby neighbors for each possible position ( i ; j ) base on the entries in M. The permutation is one by consiering the minimum of the PP values for each atabase. The PP(i,j) function is esigne to be the closeness ierence for two columns i an j. Let column i be the one that nees to be place in the temporary matrix O where O consists of the rst several columns of matrix M. Column i can be place on the left or right of column j in O. The main iea is to position column i in the place which satises two conitions: its anity measure shoul be less than or equal to the anity measure of its left neighbor an greater than or equal to the anity measure of its right neighbor. For the leftmost or the rightmost position of O, simply consier one of the above two conitions because it has only one neighbor in such cases. For each closeness ierence value, check whether it is less than zero. If yes, ignore this possible position since a negative ierence means the require conitions are not satise. Since the proceure computes only the closeness ierences of the nearby neighbors an consiers the minimum of the ierences, it tens to partition the matrix M into two clusters - one is in the upper left corner an the other is in the lower right corner. In general the borer for the split is not very clear-cut. For this purpose, a splitting phase is propose to ecie the split point. The splitting phase compares the mean value of the rst column with each iniviual value in that column in M. If the iniviual value is greater than or equal to the mean value, then it belongs to the upper left corner group. Other-

wise, it belongs to the lower right corner group. Two clusters can therefore be generate at each iteration. The mean value of the rst column is chosen to be the splitting criterion since the rst column tens to have the larger anity values. However, there must be some stopping criteria to en the iterations. There are two stopping criteria for each split/cluster proceure iteration: (1) when the size of a cluster is one, i.e. the number of atabases in the cluster is one, an () when the size of a cluster is less than four. If one of the above conitions is satise, then there is no more splitting for that cluster since it makes no sense to have a cluster with only one element in the cluster. Otherwise, each cluster executes the split/cluster proceure iteratively until one of the conitions is met. Initially, the split/cluster proceure is applie to all the atabases in the network. The proceure is iterate until no more splitting is permitte. Steps for the split/cluster proceure: 1. Preparation of the primary ata: The primary ata require are access() an use i () where i=1 : : : ( is the total number of atabases). These vectors are given as a priori from a set of historical queries. However, since the application queries issue to the atabases can be recore per time perio (say monthly or annually), the require ata can be upate accoringly.. Computation of the entries in M: rel(i; j) = P q k=1 use i(q k ) use j (q k ) access(q k ), where i, j=1 : : :. 3. Determination of the cluster size: Each cluster in the cluster hierarchy is an input to the split/cluster proceure. The size of a cluster (g) is the number of atabases in the cluster. Initially, g= since the input cluster consists of all the atabases in the network. Assuming DB GROU P I J = f1; ; : : : ; g g, g <= =) the size of DB GROU P I J = g. 4. for loop1 = 1 to g-1 = initialization of the matrix O = = place the rst loop1+1 columns of M into O = O(, 1) = M(, ); O(, ) = M(, ); : : : O(, loop1+1) = M(, loop1+1); for loop = loop1+ to g For each column in the the remaining g-(loop1+1) columns, calculate a PP vector for the loop possible positions -060-3 - for that column. Select the position by the minimum PP value. = position selections = Once the position for the column is etermine, permute the columns in O if necessary. Place the column into its corresponing position in O. = column permutations = en Once the positions for the remaining g-(loop1+1) columns are etermine, the permutations of corresponing rows are performe so that the relative positions in O are maintaine. = row permutations = M = O = upate M = en. Splitting phase: Compute the mean value of the rst column of the matrix M. This mean value is then use as the criterion for the splitting phase. Two clusters are generate from the matrix M after the splitting phase is applie. 6. Stopping criteria checking for each cluster generate in step : If the size of the cluster is one, then no splitting for this cluster an stop. Else goto step for each cluster. If the size of the cluster is less than four, then no splitting for this cluster an stop. Else goto step for each cluster.. Generating a cluster hierarchy: After all the clusters execute the split/cluster proceure an nish the stopping criteria checking, a cluster hierarchy for all the atabases in the network can be create. 3. AN EXAMPLE In this section, a simple example is use to illustrate the propose split/cluster proceure. Once the network of atabases is partitione into several clusters an each cluster consists of one or more atabases which have high anity relationships, the cost of query processing can be reuce. Example: Suppose there are atabases in the network an the historical ata consists of queries. Let D = f 1 ; ; : : : ; g an Q = fq 1 ; q ; : : : ; q g. Assume the following use i () where i=1 : : :, an access() values are the require primary ata obtaine from the set of historical ata.

1 3 4 6 1 3 4 6 1 3 4 6 1 0 0 1 0 1 0 0 3 0 0 0 0 0 3 0 3 10 30 30 1 10 30 1 0 0 160 0 1 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 3 10 30 30 1 10 30 1 3 4 6 1 1 1 0 0 0 0 1 160 1 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 3 3 0 0 0 1 3 10 10 30 30 30 1 3 10 10 30 30 30 0 0 0 0 0 30 30 6 6 6 0 0 0 0 0 30 30 6 6 6 0 0 0 0 0 30 30 6 6 6 Figure 1: Initial Anity Measure Matrix M. Each entity (i,j) in M has the relative anity value rel(i,j), where i,j=1 to. use 1 () = [1 0 0 1 1 0 0 0]; use () = [0 0 0 1 0 0 1 0]; use 3 () = [0 1 0 0 1 0 1 1]; use 4 () = [0 0 1 0 0 0 0 1]; use () = [1 0 0 1 1 1 0 0]; use 6 () = [0 0 1 0 0 0 0 1]; use () = [1 0 0 1 1 0 0 0]; use () = [0 1 0 1 1 0 0 0]; use () = [0 1 0 0 1 0 1 1]; use () = [0 0 1 0 0 0 0 1]; access() = [1 0 3 0 0 3 30]; With the availability of the primary ata, the relative anity values can be calculate. For example, the anity measure for the entity M(1,) can be obtaine by the following way. M(1,) P = rel(1, ) = k=1 use 1(q k ) use (q k ) access(q k ) = access(q1) + access(q4) + access(q) = 1. Figure : Final Anity Measure Matrix. The ashe lines separate two clusters. PP(,3)). Position 1: to the left of column 1, PP(3,1) = M(1,3) - O(1,1) = - 1 < 0; Position : in between column 1 an column, PP(1,3) = O(1,1) - M(1,3) = 1 - = ; PP(3,) = M(1,3) - O(1,) = - 0 < 0; Position 3: to the right of column, PP(,3) = O(1,) - M(1,3) = 0 - = 4; Since the PP values for positions 1 an are negative, these two possibilities are ignore. Therefore, we select to place 3 to the right of column. Similarly, all the other atabases are calculate to get an upate matrix. Finally, the rows are permute to be in the same orer as the columns. The same steps are applie for iterations to to get the - nal anity measure matrix M which is then use to illustrate the splitting phase. First, the mean of the rst column is calculate. Similarly, all the rel(i,j) entities for M can be compute. The initial anity measure matrix M is shown in Figure 1. As shown in Figure 1, each entity (i,j) in M has the value of rel(i,j) which inicates the relative anity measure for atabase i an j. In aition, M is symmetric so that the entity (i,j) has the same value as in the entity (j,i). For example, M(1,) an M(,1) have the same value 1. Take the initial anity measure matrix M an execute the rst iteration, i.e., when loop1=1. Accoring to our propose split/cluster proceure, initially the rst two columns of M are place into the temporary matrix O an column 3 (i.e., 3) is consiere next. There are three possible positions for column 3: to the left of column 1 (computing PP(3,1)), in between column 1 an (computing PP(1,3) an PP(3,)), an to the right of column (computing -060-4 - mean = (1+1+1++0+++0+0+0)/ = 4.; Accoring to the propose splitting phase, the mean value is use to consier the splitting of the matrix M. Therefore, two clusters are create: one is in the upper left corner an the other is in the lower right corner (see the ashe line in Figure ). Let the upper left corner cluster be DB GROUP 1 = f1,,,, g an the lower right corner cluster be DB GROUP = f3, 4, 6,, g. Since both clusters contain more than three atabases, each cluster nees to execute the split/cluster phase iteratively. Again, the mean value for each cluster nees to be calculate an use as the splitting criterion. Base on the mean values, the clusters DB GROUP 1 an DB GROUP 1 are further split into two more

1 1 1 1 1 0 1 160 1 0 1 1 1 0 1 0 0 0 0 0 3 4 6 3 4 6 10 10 30 30 30 10 10 30 30 30 30 30 6 6 6 30 30 6 6 6 30 30 6 6 6 11 1 (a) Figure 3: Splitting for the two clusters. Each cluster can be split into two more clusters (as shown in the ashe lines). clusters iniviually (the ashe lines in Figure 3(a) an 3(b)). Since the numbers of atabases in all the four clusters are less than four, the proceure stops. mean = (1+1+1++0)/ = (for DB GROUP 1 ) mean = (10+10+30+30+30)/ = 4 (for DB GROUP ) After all the clusters execute the split/cluster proceure an the stopping criteria checking, a cluster hierarchy for all the atabases in the network can be create. As shown in Figure 4, the cluster DB GROUP 11 which consists of all the atabases in the network is at the root of the hierarchy. Initially, the split/cluster proceure starts with DB GROUP 11 an partitions it into two clusters DB GROUP 1 an DB GROUP. Then, DB GROUP 1 can be partitione into DB GROUP 31 an DB GROUP 3, an DB GROUP can be partitione into DB GROUP 33 an DB GROUP 34. Each ner cluster consists of its own atabases. Those atabases in the same cluster shoul be highly aliate an be accesse for query information. This cluster hierarchy is then use to ecie where a query shoul be searche for the requeste information to reuce the cost of query processing. 4. CONCLUSIONS In this paper, we propose an object-oriente approach to partition a large scale network of atabases into a set of clusters. We have formalize a new set of relative anity measures to represent how frequently two atabases have been accesse together by a set of historical queries. Anity-base measures are both intuitively reasonable an unerstanable since they consier the access frequencies of queries. We gave a split/cluster proceure for clustering the atabases. The split/cluster proceure inclues a splitting phase an two stopping criteria, an is execute iteratively. A simple example is run to illustrate the steps of the propose split/cluster proceure. A cluster hier- -060 - - (b) 31 3 33 34 1 3 4 6 Figure 4: The resulting cluster hierarchy. The lowest level of the hierarchy consists of the iniviual atabases in the network. The root cluster of the hierarchy consists of all the atabases. The clusters at each level have their own member atabases. archy which provies ierent levels of abstractions for users to incrementally an ynamically access the information is forme. Since a set of atabases belonging to a certain application omain is place in the same cluster an is require consecutively on some query access path, the number of platter switches for query processing can be reuce. Moreover, the constructe clusters can be use as the unit not only for query processing but also for iscovering the objectoriente relationships such as superclass, subclass, an equivalence relationships. References [1] D.M. Dilts an W. Wu, Using knowlege-base technology to integrate CIM atabases, IEEE Trans. Knowlege Data Eng., vol. 3(), June 11. [] W. Gotthar, P.C. Lockemann, an A. Neufel, System-guie view integration for objectoriente atabases, Knowlege Data Eng., vol. 4(1), Feb. 1. [3] W. Litwin, L. Mark, an N. Roussopoulos, Interoperability of multiple autonomous atabases, ACM Computing Surveys,, 10, pp. 6-3. [4] J.S. Quarterman an J.C. Hoskins, Notable computer networks, Communication of ACM, vol. (), 16, pp. 3-1. [] M.P. Rey, B.E. Prasa, P.G. Rey, an A. Gupta, A methoology for integration of heterogeneous atabases, IEEE Trans. Knowlege Data Eng., vol. 6(6), December 14.