Point-in-Polygon linking with OGSA-DAI

Point-in-Polygon linking with OGSA-DAI Xiaochi Ma MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2006 1

Point-in-Polygon linking With OGSA-DAI Xiaochi Ma 2

Abstract Distributed data resources can be heterogeneous in their formats, schema, amount, hardware/software implementation, management policy and etc. If people want to access and share these heterogeneous data resources, there should be software that can hide the heterogeneity of resources and support the functionalities of shared resource in a standard way. A procedure for integrating some special geographic information data on Grid is presented in this dissertation. The basic procedure and relating performance issues are discussed and the implementation is based on OGSA-DAI Grid middleware is described. 3

1. Introduction... 5 1.1. Concept of Grid... 5 1.2. Concept of Web Service... 6 1.3. OGSA-DAI Grid middleware...7 1.4. Point-in-Polygon linking in spatial database... 8 1.5. Aims of this project... 9 1.6. Content of rest chapters...10 2. Overview of OGSA-DAI...11 2.1. Document-Oriented interface... 12 2.2. Data Service Resource...12 2.3. OGSA-DAI Activity...13 2.4. How data flow through activities...14 2.5. Synchronous and asynchronous request... 15 2.6. Summary... 15 3. Data migration in OGSA-DAI... 16 3.1. Basic transportation procedure...17 3.2. Transportation model...18 3.3. Measuring activity performance experiment... 18 3.4. Experiment aims...19 3.5. How to measure performance at client side...19 3.6. Experiment method... 21 3.7. Experiment Result... 21 3.8. Conclusion...24 3.9. Experimental setup... 24 4. Creating DataBuffer Activity... 25 4.1. Why creating DataBuffer activity...25 4.2. How it improves performance...27 4.3. Synchronization objects for DataBuffer... 28 4.4. Implementation detail...29 5. Spatial linking with OGSA-DAI... 30 5.1. Spatial database working with OGSA-DAI...30 5.2. How the linking operation is performed...31 5.3. How to improve efficiency of spatial join... 32 5.4. Spatial data migration in OGSA-DAI... 34 5.5. The procedure of data linking...36 5.6. Migrating tables of normal relational data...38 5.7. Creating spatial data column... 39 5.8. Summary... 40 6. Further works... 42 6.1. Migrating table schema between databases...42 6.2. Transferring spatial data... 43 6.3. Create a data resource accessor for PostGIS database...44 7. Conclusion...45 8. Reference...48 4

1. Introduction In this chapter, some basic concepts about Grid and Web Service are introduced; it also briefs the motivation of developing OGSA-DAI middleware and the roles of this middleware within this project. The application scenario and aims of this project is also discussed there, there is an example illustrating the data integration problem that could be solved by Grid. And contents of other chapters will be presented in the last section of this chapter. 1.1. Concept of Grid The Grid technology offers new opportunities for integrating and sharing large scale, dynamic, autonomous and geographically distributed computing resources. Compared with the legacy distributed computing system, the participants of Grid computing environment can be more dynamic and various. For example, in the conventional distributed data base system, most databases are usually supported by the same vendor; implementation and management of these data resources follow the identical principles. In Grid environment, the resources contained in it have various implementations and belong to different organization and security domains, so to share these resources among the participants the legacy distributed system seems not to have enough flexibility. The desired Grid environment can be considered as a pool of resources, users do not need to know the detail of each resource, adding and withdrawing the resource from the pool is dynamic. However in the conventional distributed computing environment, user is aware of the capabilities of each resource and the resource configuration is usually static. Sharing the heterogeneous resources in a dynamic way is one desired goal of Grid technology. Another initial goal of Grid is to expose the functionality of shared resource in a standard interface, applications and user of different platforms all can access these resources and use the special functionalities. The standards for communication and interaction between client and resource of Gird should be well accepted and platform-independent. The Grid should have a mechanism that can hide implementation heterogeneity of resource from user and expose these shard resources in a stand interface which can be accessed through different platform. In an ideal situation, the service from Grid should be seamless and standardized. To summarize, the motivation of Grid is all about gathering different kinds of resources together and sharing them among different users and applications. 5

1.2. Concept of Web Service In order to achieve the desired functions of Grid, the participants of various software and hardware systems should interoperated with each other very well. To solve the problems during implementation of Grid, we need to borrow a lot of primary web technology. Web service is a technology that allows applications to communicate with each other in a platform- and programming language-independent manner. [1]. Due to the interoperate character of web service; it is very desirable for implementing the initial goals of Grid concept. The promise of web service is to enable a distributed environment in which any number of applications, or application components, can interoperate seamlessly among and between organizations in a platform- and language- neutral fashion [2]. Grid is a higher level concept, to implement those desired goals; it needs a stack of protocol to standardize the functionalities of Gird, e.g. exposing resource, accessing resource, invoking special functionality Web Service technology is a lower level concept which can be used to realize that goal of Grid, just like the Object-Oriented technique for Software concept. There are some characters of web service technology: XML based The XML technology is the core foundation of web service as it provides platform independence and interoperability. Using the XML standard as format of communication can eliminate the protocol- or platform- specification of different participant in the Gird Loose coupled A consumer of a web service is not tied to that web service directly; the web service interface can change over time without compromising the clients ability to interact with the service. [3] In the Grid environment adding and withdrawing resource should be dynamic, so the loose-coupled character enables dynamic resource configuration in Grid. Supports Remote Procedure Calls (RPCs) Web service allows clients to invoke procedures, functions, and methods on remote object using an XML based protocol. [3]. This mechanism can be used by client to invoke some functionality of remote resources in Grid. The manipulation on resources can be exposed as a service interface, and the input and output of the execution is transferred between client and service in a XMLbased format. Three technologies are used to support the core functionalities of web service, they are: 6

Simple Object Access Protocol (SOAP) SOAP is an XML based protocol for exchange information through a distributed environment. It provides a standard for encapsulating the data and transporting the document with lower level internet protocol, e.g. HTTP, SMTP. By working this SOAP, the communication of heterogeneous participant becomes standardized and interpretable. Web Service Description Language (WSDL) This technology is used to describe the web service interface, WSDL is an XML grammar for describing a collection of access endpoints capable of exchanging message in a document-oriented manner [3]. Universal Description, Discovery, and Integration(UDDI) This is used to provide a registry and directory model for advertising and discovering a web service. The web service technology is not intend to solve the heterogeneous problem of resources, however it supports a stack of protocols to standardize the way of interaction between client and the desired resources. The initial goal of Grid is to support a standard interface for heterogeneous resources, the loose-coupled pattern allows the resource is deployed without noticing client, and PRC mechanism supports the distributed resource can be accessed from local client. As the web service and Grid have a lot of similar characters and behaviors, so the concept of Grid is implemented on the stack of web service standards. 1.3. OGSA-DAI Grid middleware With the web service technology and corresponding protocol, we can develop the Grid middleware for resource configuration and management. The Open Grid Service Architecture Data Access and Integration provides such implementations which can exposure data resource via a web service interface. The data resource there is usually referred as a database system or a plain text file, which can store or output data. OGSA-DAI is a middleware product that allows data resources, such as relational or XML databases, to be accessed via web services. An OGSA-DAI web service allows data to be queried, updated, transformed and delivered. OGSA-DAI web services can be used to provide web services that offer data integration services to clients [4]. One of the motivations of developing OGSA-DAI is to support integration of data from various data resources, by interacting with standard web service interfaces users can integrate datasets that are stored in different kinds of database. By working with the OGSA-DAI client toolkit, users can either extend the OGSA-DAI functionalities e.g. 7

developing a new activity supported by a data service resource, or develop client-side software of OGSA-DAI Grid. In this project, each data base is exposed with a OGSA- DAI data service interface, the procedure of data integration is controlled by a client software which is developed with OGSA-DAI client toolkit. In the data Grid enabled by OGSA-DAI, the actual data resource is hidden from user by a standard web service interface. The client dose not interact with resource directly, otherwise it communicate the data service interface. For each data service interface, it could expose more than one data service resource which is associated to a single data resource. To manipulate on the data resource, client needs to know the address of web service interface and the ID of data service resource. The web service address will indicate where the perform document is sent, and specified resource ID tells service on which data resource the manipulation is performed. After execution, the result is returned in a XML encoded document to client. Client Data Service Data Service Resource Data Resource Figure 1: Interacting data resource via OGSA-DAI data service 1.4. Point-in-Polygon linking in spatial database In the normal relation database, if we want to perform join operations over two tables, then both of the joined table must have one or more identical columns. Most time, the equal symbol =, can be used in the SQL language to join the tables which have some identical columns. However, in spatial database, it could be meaningless to compare the values of a geographic coordinate point and that of a polygon. To tell whether or not the point is inside of a polygon, the database itself supports a specific function to calculate the distance between point and polygon. In spatial database, we can join the tables that have relative geographic information. The point-in-polygon linking operation can join the tables together, which either has a coordinate or a bounding polygon column. In application scenario, the tables that contain spatial columns are stored in remote database. The normal relational database could be unable to perform the spatial linking operations, so there could be two choices regarding where the spatial join operation will be carried. One method is to join data in local memory that requires user to establish a spatial database in their local machine; another one is to join them in the remote database 8

in where the geographic data residents. If we can invoke the special functionality of spatial database via an OGSA-DAI data service interface, the problem could be easy to solve. Another problem of joining data in different database is that requires data to be moved between databases. It is important to discover and analyze the performance of data transportation in OGSA-DAI and explore an efficient mechanism for doing this. 1.5. Aims of this project The aim of this project is to explore the issues relating to following application scenarios: The user is aware of the following public databases: A database that contains some user interested data, e.g. house price, average salary, regional education level and etc. These public data are categorized according to the geographic zones. Each zone is defined as a polygon defined by a set of geographic coordinate points. Another database that has a intermediate table mapping a postcode to the corresponding coordinate point Accesses to these databases go through the web service interface exposed by OGSA-DAI server. The user has his own database which is also exposed by OGSA-DAI and in his database there are some private datasets containing postcode column. In the application scenario, user needs to specify the relating columns of joined tables and required public data; the client application will perform the data integration across tables and move the final result back to users database. To get the final dataset, it needs to perform the ordinary join operation among the normal relational tables and spatial join, as point-in-polygon linking, in spatial database. When performing the linking operation, the participating tables must be stored in the same database, if the tables are stored in different databases, the client should know how to move and join the data efficiently. A simple scenario of data integration is presented here: Table A contain a public data set in spatial database and user want to find out the corresponding hours price and salary in his local region. User also can create table and move data to this table. 9

Polygon House Price Salary Polygon(PointA, PointB,...) 120,000 25,000 Polygon(PointC, PointD,...) 140,000 20,000 Table B mapping postcode to a geographic coordinate, this table is supposed to belong to a different spatial database. This table working as an index, by comparing the value of Postcode column with user specified postcode, we can find out the geographic position of user s postcode. The point-in-polygon linking is performed between this table and tablea. Postcode EH3 5LW EH5 2PU osgridcoord PointM PointN Table C is user private data set and stored in user own database. Operator = can be used to perform join operation between this table and table B, this linking operation can be carried out in a normal relational database. User data (id, address,...) (id, address,...) Postcode EH5 2PU EH3 5LW User wants to get the following dataset: User data postcode house price salary (id,address ) EH5 2PU 120,000 25,000 (id,address ) EH3 5LW 140,000 20,000 For the postcode of user database, most of them could belong to a small area, which means a few polygons. However the polygon table in public database could store the geographic information covers a worldwide region, which needs a larger number of polygon data rows. If we can remove small unnecessary polygon data before joining them with user specified point data set, it could improve the performance of data transportation and the efficiency of spatial linking. In this project, there is an experiment for doing the spatial linking in an efficient way; the detail will discussed in chapter 5. 10

1.6. Content of rest chapters The organization of the remaining chapters as follows: Chapter 2 introduces each component of OGSA-DAI, mainly focus on Activity and how it works. Chapter 3 focuses on the data delivery mechanism of OGSA-DAI and performance issues; it will analyze the performance of activities in the chain for data transportation between services. Chapter 4 introduces the DataBuffer activity and how it improve the performance by making the activities working in parallel both on sink and source side. Chapter 5 presents the procedure of performing spatial linking operations within OGSA-DAI Grid environment, a efficiency mechanism of doing the point-in-polygon linking operations will be presented. In Chapter 6, the further works for this project is discussed. Chapter 7 is the last chapter, it summarize the current works for this project 11

2. Overview of OGSA-DAI In this project, all the data sets are manipulated via the OGSA-DAI enabled web service interface. In application scenario, users only need to interact with these data service interfaces; the detail of data manipulation is hidden from client. To understand the way how OGSA-DAI works, it is important to understand its working procedure and architecture. In this chapter, the concept of data service, data service resource and activity is discussed; the working procedure of activity chain is also presented. The OGSA-DAI middleware is a web service implementation of data Grid. In OGSA- DAI Grid, each data resource is exposed via a data service interface, the intrinsic functionalities of the resource is exposed as a kind of web service. An OGSA-DAI web service allows data to be queried, updated, transformed and delivered. OGSA-DAI web services can be used to provide web services that offer data integration services to clients. [4] There are several components and interfaces in the architecture of OGSA- DAI middleware. How the components interact is presented in Figure 2: Figure 2: interaction via a document-oriented interface [4] OGSA-DAI can provide web services follow two kinds of specification: Web Service Inter-operability (WSI) and Web Service Resource Framework (WSRF). In this project, all experiments are implemented with OGSA-DAI (WSI) 2.1. Document-Oriented interface OGSA-DAI middleware is designed to support web service to the user for data integration purpose. For the interoperability of web service, the service should be accessible through various platforms, so it needs a stack of protocols to regulate the communication between client and service. As a well accepted format, XML standard is chosen as the format of message in web service. A Web service is an interface that describes a collection of operations that are network-accessible through standardized XML messaging. [3]. 12

In OGSA-DAI the interaction between client and data service interface is achieved by sending Perform Document (client-to-service) and receiving Response Document (service-to-client). Both kinds of documents used for communication are written in a XML manner. In perform document, the content is a chain of OGSA-DAI activities which are pipeline together. The perform document instructs the service what manipulation is carried on the data resource. Actually the data service does not parse the content of perform document, when receiving the document it will be forward to data service resource where the activity is executed. The data service interface is the Endpoint for the interaction between client and service. When the data service resource finish the execution of activities in the perform document, the result or executing status will be written in the response document and sent back to client. In OGSA-DAI, the communication between client and service is achieved by exchanging document. 2.2. Data Service Resource In the architecture of OGSA-DAI, data service is only used for contracting with client, and data service resource is the point where user s instruction is understood and carried on. The separation of data service and data service resource can hide the implementation detail in the server-side from client; this is helpful for supporting seamless service. One data service can associate with several data service resource, so in client side, to get an instance of web service, it needs to specify the address of data service and the ID of associated resource. Inside of OGSA-DAI web service, the perform document is consumed at data service resource, in where its content will be parsed and executed. The data service resource interprets the activities into corresponding data manipulations and wraps the executing result from database system into response document. Each data service resource can only be associated with one single data resource. 2.3. OGSA-DAI Activity In OGSA-DAI, the capabilities of a data service resource is referred as activity, each activity represents a kind of operation that can be performed by the data service resource on the data resource. Activities are the operations that data service resources can perform on behalf of a client. These normally expose an intrinsic capability of the underlying data resource or may be functionality added at the service layer. [4] The activity is the key concept of OGSA-DAI, all user s instructions to service are represented by the activities in the perform document. Usually the activities are pipelined in a chain and this is achieved by connecting the output of one activity to the input of 13

next one. When data service resource executing the activity chain, the data is streamed from the first activity to the last one. An example of activity chain is illustrated at Figure 3: DataStore activity DataBuffer activity Tokenizer activity Figure 3: Activity chain Due to the activity can be an additional functionality of service layer, it means user can extend the OGSA-DAI activity by adding some specific manipulation at the service side. The perform document is used by client to instruct data service resource which manipulation is performed on the data resource. All the client specified activities are written into the content of perform document which is XML-based. There is an example of perform document, in this document there is an activity chain for synchronous query. Activity SQLQuery and WebRowSet are included. <? xml version="1.0" encoding="utf-8"?> <perform xmlns="http://ogsadai.org.uk/namespaces/2005/10/types"> <documentation> Perform a simple SELECT statement and transform the results into WebRowSet XML. </documentation> <sqlquerystatement name="statement"/> <expression>select * from littleblackbook where id=10</expression> <resultstream name="statementoutput"/> </sqlquerystatement> <sqlresultstoxml name="webrowset"> <resultset from="statementoutput"/> <webrowset name="webrowsetoutput"/> </sqlresultstoxml> </perform> [4] In this perform document, element sqlquerystatement represents the SQLQuery activity, it has two sub elements expression and resultstream. The content is expression element is the SQL query statement which is executed on the data resource. The attribute name of resultstream element specifies the name the output of query result. The attribute from of resultset element specifies which stream is used as input of WebRowSet activity. Both of the attributes have the same value "statementoutput", that means the output the SQLQuery is connected to the input of WebRowSet. 14

2.4. How data flow through activities When the activity chain is executed by the data service resource, at any moment there is only one activity which is being executed. The activity will not ask data from the next one until the current execution is finished. For example, a request that has a chain containing SQLQuery, WebRowSet and SQLBulkLoad activities, when data service resource begins to execute the request, the query operation is not carried first. Otherwise the first activity being executed is SQLBulkLoad, it asks one block of data from its input port which is connected to the output of WebRowSet, and then that activity is executed. The query operation is not started until the WebRowSet asking data from it. The following figure shows how data goes through the activities. Output SQlBulkLoad Input Ask for data Output WebRowSet Input Ask for data Output SQLQuery Figure 4: Data goes through activities. The data streamed through activities is in unit of block. A block can be a Java Object of any type although usually they are Strings or byte arrays. [4]. In the above example, the data block of SQLQuery output is a object of ResultSet object, and for WebRowSet, a single data block means the XML formatted data of one row from query result. 2.5. Synchronous and asynchronous request The activity request can be executed synchronously or asynchronously, the execution pattern depends on the executing activity. The activity that does not have attached output is executed asynchronously, e.g. DeliverToNull, when this activity is append at the end of activity chain, the response perform will return immediately after data service receive the perform document. If user does not want to wait a long time for the executing result or the output is not required, the asynchronous request is very useful. 15

2.6. Summary In this chapter, some basic component and concept of OGSA-DAI is discussed there. The data service is a general interface that tells client where to find the shared data resource. In order to access a certain data resource, user to need specify the id of data service resource this is associated with the underlying resource. All the manipulations performed on the data resource are represented as activities of data service resource. Activity also can be a functionality of service layer. 16

3. Data migration in OGSA-DAI In the application scenario, to join tables of different database it needs to duplicate some data set by moving data from one data base to another. Before the join operation begins, the joined data set should be stored in the same data base, and after the join operation, the linked data set is also needed to transfer to a scratch database for further retrieve. So the performance of data transportation part is crucial for the performance of whole linking procedure. It is important to understand the performance issue and discover an efficient mechanism to migrate data between OGSA-DAI data service. OGSA-DAI supports various ways for data delivery, the transportation mechanism used in this project is to connect to output data to session stream exposed by the data service resource of source service, and sink service will pull the data from the stream and insert it to the target table. The session there is used to store some status information for interaction between multiple activity requests executed on the same data service resource. When executing the request at sink data service resource, a new session requirement is specified and the output stream activity will create a session stream. If no session requirement is specified for the activity request, then the request will join the implicit session which will be terminated when the execution is over. For the transportation works correctly, we need the session stream exist until all the required data is delivered to sink service, so an explicit session is created when processing the request at source data service. 3.1. Basic transportation procedure In order to make the data is pulled from source service to sink service, client should send three perform documents. The first one is for source service in the transportation, it contains the activities which query the required data and connect it into the output stream. The second perform document is executed by the data service resource at sink side, it instructs the sink service where the output stream is and how to manipulate the delivered data. When the data delivery is finished, client also needs to send another perform document to source service to terminate the specified session for transportation. 17

An example of pulling data between services is illustrated in Figure 5: Source Service Sink Service SQLQuery WebRowSet DTOutput Source Session DeliverFromDT SQLBulkLoad DeliverToNull Figure 5: Pulling data from source service When the first perform document is sent to source service, the end of activity chain is DTOutput, which is an activity without attached output. So the activity request is executed at asynchronous pattern, at this moment the data is not streamed to the output stream yet. Then data is pulled from source to sink service when the SQLBulkLoad activity is executed by the data service resource at sink side. For each time, the Bulkload activity asks one block of data from DeliverFromDT which can retrieve data from the output stream exposed by the data service resource of source service. The following figure illustrates how the bulk load activity retrieves the data from source service. The bulk load activity should wait for the completion of data transportation. SQLBulkLoad DeliverFromDT Output stream processblock getnblock Figure 6: How data is moved across activities 18

3.2. Transportation model In OGSA-DAI, there are two types of models for data transportation, Block and Full. In the above example, the transportation between services uses Simple Object Access Protocol, SOAP as the protocol for data delivery. In a SOAP message the data is capsulated into blocks. In Block model the maximum number of blocks is a constant for each message, if the delivered data blocks exceed the specified number, source service will need to construct another SOAP message for holding the rest data, which could cost some time. In Full model all the data in is written in one message, so source service only needs to produce one SOAP message, but if the number of data blocks is very large, it could require more memory for constructing message. 3.3. Measuring activity performance experiment To analyze the performance of data migration between OGSA-DAI data services, it needs to measure the processing time of different perform documents. In an activity chain, there are several activities, for performance analyzing, it needs to know the efficiency of each activity in the chain. So to get the processing time of activity and perform document, we can put some time measurement code at the server side. In many situations, users would not be to able to add the profiling code at the servers administrated by other people. For the users, it could be interesting and useful to explore a mechanism for measuring the performance and identifying the bottleneck of activity request at client side. 3.4. Experiment aims There are three stages in the whole data migration procedure, first stage is for querying data and converting it into WebRowSet format for further delivery and insertion, that needs executing activity SQLQuery and WebRowSet; second stage is for connecting the query result into output stream of source data service and delivering it to the input stream of sink service, which are tasks of DeliverToDT and DeliverFromDT; the last stage is for inserting data into target table with activity SQLBulkLoad. These activities are used for delivering data between data service, their behaviors directly effect the performance of data transportation; in the experiment the performance of each stage in data transportation is measured. By analyzing the experiment result, we can discover the efficiency of these activities and find out the inefficient part in the transportation procedure. 19

3.5. How to measure performance at client side To get to processing time of perform document, we can record the system time before sending perform document and after receiving the response document. In the implementation code, it looks likes this: long start = System.currentTimeMillis(); Service.perform(request); long stop = System.currentTimeMillis(); time = stop start; Method System.currentTimeMillis () records the time before sending the perform document and receiving response document. By comparing the difference, we can know how much time is used for processing the activities in the perform document. However, for obtaining the performance of asynchronous request, this method is not available, because the response document is returned before the execution of activities of this request is finished. The method polluntilrequestcompleted can be used to wait until the execution of asynchronous request is finished.... long start = System.currentTimeMillis(); Sink.perform (sinkreq); Source.pollUntilRequestCompleted (source_session.getsessionid (), 100); long stop = System.currentTimeMillis(); The time interval is a parameter of the poll method, in the above code the execution status of source service is checked every 100 milliseconds till it is set as COMELETED. To measure the performance of asynchronous request for data transportation, the poll method can be used to obtain the processing time of perform document executed at source side. In the experiment, the client did not check the processing status of source session every millisecond because handling the status request from the client will slow down the data transportation work of the source server. Instead, the status of the source session is checked every 100 milliseconds. However, doing this will introduce some error into the result data because the session could be finished just after the client sends the last status request. In that situation there will be almost extra 100 milliseconds in the session processing time. When a source data service delivers a relatively large amount of data, for example hundreds of rows, to a sink service, the whole processing time is much larger than 100 milliseconds. For instance, when the number of delivered rows is 125, the total 20

processing time is around 0.4683s. The extra 100 milliseconds only takes 2.135% of the total time. So this error is tolerable (or can be ignored) because the further performance analysis is based on thousands of rows. After obtaining the processing time of activity request, by comparing the difference of these performances, the efficiency of each activity can be calculated. For example, one request contains activity A, B and C, another request only has activity A and C. We can know the processing time of activity B by comparing the time of two requests. In the data migration procedure, the activity chain can be divided into three sub parts: 1. Query the source data base and transform the result into specified format for transportation and inserting into sink data base. The corresponding activities are SQLQuery and WebRowSet. 2. Serialize and deserialize the data for transportation. Corresponding activities are DeliverToDT and DeliverFromDT. 3. Loading the data into tables in the sink side. The output of SQLBulkLoad is the number of updated rows, usually user do not need to know that. So adding the DeliverToNull to the end of activity chain, in order to make the request processed asynchronously. The performance of part 1 and 2 can be simply measured at the client side by timing the executing of activity request; but to get the time used for delivering data between data services, the clients need to measure the performance of executing the asynchronous request at source service by polling it until the session status is completed. 3.6. Experiment method There are two activity requests: Request#1: SQLQuery + WebRowSet + DeliverToNull; Request#2: SQLQuery + WebRowSet + SQLBulkLoad; By executing the two requests on the same data service resource, it can get the processing time of requests #1 and #2, T1 and T2. Since DeliverToNull activity does not manipulate data, the processing time of this activity is much smaller than that of the time consuming activities SQLQuery and WebRowSet. Therefore, the overhead of the DeliverToNull can be ignored in the request#1. As a result, the difference in the processing time between T1 and T2 can be assumed as the executing time of SQLBulkLoad activity. 21

In the following experiment, the overhead of DeliverToNull activity is always ignored as it is insignificant in measuring the execution time of requests. The performance of DeliverToDT and DeliverFromDT can be obtained as follow: Request#3: SQLQuery + WebRowSet + DTOutput (source side) DeliverFromDT +DeliverToNull; (sink side) Actually to execute this request, it needs two perform documents which are sent to source and sink service individually. The polluntilrequestcompleted method is used for measuring the processing time of asynchronous request. Subtracting the processing time T3 by T1, we can get the processing time for delivering data form output stream of source service to the input stream of sink service. To get a stable performance data, all the activity requests are executed 30 times in the experiment, and result is an average value. The transportation model is set as Block, and the number of data blocks is 4000 for one go 3.7. Experiment Result The performance is measured in metric of second and the amount of delivered data is increased from 125 rows to 16000. The following table shows the processing time of each stage in the transportation of different number of rows Rows SQLQuery+WebRowSet DTOutput+DeliverFromDT SQLBulkLoad 125 0.1402s 0.1689s 0.1592s 250 0.1412s 0.1820s 0.1700s 500 0.1108s 0.3184s 0.2945s 1000 0.1135s 0.5672s 0.4907s 2000 0.1405s 1.2464s 0.8536s 4000 0.0935s 3.0780s 1.4651s 8000 0.0975s 6.5223s 2.9185s 16000 0.0911s 15.7073s 6.0635s Due to the above result data is an average value, the standard deviation is also recorded when getting the performance data. The value of standard deviation implies the variant in the result data of each execution. Rows SQLQuery+WebRowSet DTOutput+DeliverFromDT SQLBulkLoad 125 0.0369 0.3016 0.0205 22

250 0.0868 0.2903 0.0163 500 0.0177 0.4138 0.1164 1000 0.0258 0.6614 0.0230 2000 0.1049 0.8063 0.0216 4000 0.0218 0.4804 0.0117 8000 0.0228 0.7044 0.0894 16000 0.0161 1.8000 0.1154 Put data into output stream, delivery it over services and get them form input stream in sink side is the most costly part in the data transportation procedure, when increasing the number of delivered rows the consuming time of this part will become dominate in the whole processing time. And the cost of SQLBulkLoad also increases with the amount the data transportation. The percentage of executing time is illustrated in Figure 7 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% SQLBulkLoad DTOutput+DeliverFromDT SQLQuery +WebRowSet 20.00% 10.00% 0.00% 125 250 500 1000 2000 4000 8000 16000 # of rows Figure 7: The proportion of processing time of activities When increasing the amount of delivering rows, the consuming time of serialization and deserialization activities DTOutput and DeliverFromDT became dominate, almost 65% of whole processing time is consumed by these two activities. Another costly activity is SQLBulkLoad, which takes about 33% of processing time, especially when number of delivering rows exceeds 4000. 23

Time[s] 16.0000 15.0000 14.0000 13.0000 12.0000 11.0000 10.0000 9.0000 8.0000 7.0000 6.0000 5.0000 4.0000 3.0000 2.0000 1.0000 0.0000 125 250 500 1000 2000 4000 8000 16000 # of rows Figure 8: Performance of activities SQLQuery +WebRowSet DTOutput+Deliv erfromdt SQLBulkLoad The above figure shows how the amount of delivered data affects the performance in each part of the whole data transportation procedure. With the increase of number of rows, putting the data into the output stream exposed by data service resource of source service and retrieving data from the exposed stream into sink service will consume the most part of processing time for data transportation. 3.8. Conclusion By comparing the processing time of different activity request at client side, user is able to measure the performance of data migration procedure and identify the bottleneck activity. According to the performance data, when delivering a large amount of data, most processing time is used by DTOutput, DeliverFromDT and SQLBulkLoad. Two of the most inefficient activities are executed by data service resource at the sink service side. So if we can improve the performance in sink side, the performance of whole transportation procedure could get better. 3.9. Experimental setup The benchmarks presented here were performed using OGSA-DAI client toolkit and server and client software being run on the same machine, the hardware and software implementation are described below: Two OGSA-DAI data service both running on the same machine and the associated databases are all belonged to the same MySQL database server. 24

OS WindowsXP CPU Intel(mobile) PentiumⅢ 1.0GHz MEMORY 512MB JAVA Sun J2SE 1.5.0_06 Database MySQL 5.0 Row length 66bytes Row schema Int(11), varchar(64), varchar(128), varchar(20) Both client and server were running JVM started with the following flag: -Xms128m -Xmx128m. All the experiment result is obtained based 30 times execution and the data is an average value. 25

4. Creating DataBuffer Activity Activity is an extension point of OGSA-DAI middleware, besides the activities that represent the intrinsic capabilities of underlying data resource; the other activities are functionalities of service layer. This chapter will introduce a user created activity, DataBuffer, which is used to improve the performance at sink side during data migration. The implementation procedure and working principle is discussed there, and the improve performance is also described. 4.1. Why creating DataBuffer activity According to the above experiment data, we know the processing time of SQLBulkLoad activity and that of retrieving data from output stream of source service to sink service. In the sink side, SQLBulkLoad activity directly gets data from output of DeliverFromDT, so in the conventional data transportation procedure, the data delivery and insertion are executed in sequence. However when we comparing the sum of each part s processing time in the data transportation with the processing time of the whole data transportation procedure, we find the time used for processing request that contains both data delivery and insertion activities is longer than the sum of processing time of request which either has transportation activity or insertion activity. There is an example for proving this: Suppose there are two activity requests: Request#1: SQLQuery + WebRowSet + DTOutput DeliverFromDT + SQLBulkLoad + DeliverToNull; Request#2: SQLQuery + WebRowSet + DTOutput DeliverFromDT + DeliverToNull; The processing time of request#1, T1 is much larger than that of request#2, because in request#2 the data transportation does not need to stop and wait for the execution of SQLBulkLoad. The yellow curve in the following figure represents the sum of processing time of request#2 and individual executing time of activity SQLBulkLoad. If we can make delivering and inserting data happened in the same time with a single request, then performance of that request could be close to the yellow curve. 26

Time[s] 27.5000 25.0000 22.5000 20.0000 17.5000 15.0000 12.5000 10.0000 7.5000 5.0000 2.5000 0.0000 125 250 500 1000 2000 4000 8000 16000 # of rows Figure 9: Performance of requests and activity Request#1 Request#2 Request#2 + SQLBulkLoad When the source data service resource finished the execution of DTOutput activity, a new output stream is exposed at the source service, at this moment the real data transportation is not started yet. The delivery is actually started when the sink data service resource begin the execution of SQLBulkLoad activity. This activity inserts multiple rows into a relational table, each time it loads one block of data from the previous activity in the activities chain. Comparing the request that has a SQLBulkLoad activity, the activity request, which just deliver the data from source to sink without inserting table, it consume less processing time Because the SQLBulkLoad activity does not load the next data block until it finish the current insertion. If we can put a buffer activity which can make transportation and uploading working in parallel, the performance could be improved, because SQLBulkLoad activity does not need to wait the transportation of next data when it finish the previous insertion. Once the DataBuffer is executed by data service resource, it will pull data from source service and store them in buffer in the order of coming, at the mean time it also support the next activity in the chain, e.g. SQLBulkLoad, with buffered data. The execution of DataBuffer activity will not stop until there is no more coming data and buffered data. 4.2. How it improves performance Adding the DataBuffer activity in the position between DeliverFromDT and SQLBulkLoad, it will lower the time of transportation by keeping retrieving data from input stream and moving arrived data to the next activity connected to it. The following activity requests are executed to compare the performance Request#1, Source side: SQLQuery + WebRowSet + DTOutput Sink side: DeliverFromDT + DeliverToNull 27

Request#2, Source side: SQLQuery + WebRowSet + DTOutput Sink side: DeliverFromDT + SQLBulkLoad + DeliverToNull Request#3, Source side: SQLQuery + WebRowSet + DTOutput Sink side: DeliverFromDT + DataBuffer + SQLBulkLoad+ DeliverToNull 27.5000 25.0000 22.5000 20.0000 17.5000 Request#1 Request#2 Request#3 Time[s] 15.0000 12.5000 10.0000 7.5000 5.0000 2.5000 0.0000 125 250 500 1000 2000 4000 8000 16000 # of Rows Figure 10: Improved performance by DataBuffer activity Rows Request#1 Request#2 Request#3 125 0.3091s 0.5711s 0.7548s 250 0.3232s 0.5147s 0.5425s 500 0.4292s 0.5569s 0.5962s 1000 0.6807s 0.8329s 0.8923s 2000 1.3869s 0.7387s 1.0504s 4000 3.1715s 6.6112s 3.5624s 8000 6.6198s 12.5256s 9.8975s 16000 15.7984s 25.1434s 19.6178s When the number of delivered rows exceeds the number of blocks in SOAP message, source service will generate another one SOAP envelope and send it to sink service. When executing request#2, the data will not be pulled from source side until the sink data service resource finish the execution of SQLBulkLoad activity. The DataBuffer activity can load the rest of data from the source side while SQLBulkLoad activity is being executed, so the sink service does not need to stop and wait for the data. 28

4.3. Synchronization objects for DataBuffer In order to make the transportation and uploading worked in parallel, when the implementation code of DataBuffer activity is executed there should be two or more threads generated for writing the data from activity input into buffer and moving buffered data to activity output. To achieve the synchronization between ReadBuffer and WriteBuffer threads, the implementation of Buffer class should be carefully considered. Otherwise it will cause deadlock while multiple threads try to access the shared buffer instance at the same time. To make the parallel threads working correctly, following areas should be considered: A same buffer instance should be used to initialize both read and write threads. The reading and writing operations to the buffer should be synchronized, when these methods are executed, none of them will return until the operation is finished. Adding and removing data from buffer should notify all the waiting thread, at any moment there is only one active thread operating on the buffer, others waiting for the finish of the active one. The data stored in the buffer should be moved following the first come first out principle. If the buffer is fully filled, then the writing operations should be blocked till there is available space, and the reading operations only can go on when the buffer does has data in it. The life cycle of the WriteBuffer thread should be as follows: Stage1: checking the activity input port, if there is no more coming data block, executing setcompeleted method of buffer class for notifying other threads working on the buffer and terminate, otherwise get one data block and goes to stage 2. Stage2: checking if the buffer is fully filled with data, if not, inserting the block to end of queue. Otherwise keep waiting status till buffer is available for writing. Stage3: notify the availability of buffer to other threads and goes to stage 1. The life cycle of ReadBuffer thread is like this: Stage1: check if there is a data block in the buffer or a working WriteBuffer activity, if not thread terminate itself, otherwise gets one block from the head of the queue. Stage2: put the data block into activity output port and goes to stage 1. 29

The manipulation methods on the buffer object should be synchronized between threads; this can avoid the write/read race caused by multiple parallel threads. The return value of setcompleted method of buffer object is used as a signal to notify the ReadBuffer thread there is no more data coming from activity input and WriteBuffer thread is terminated. 4.4. Implementation detail To implement the class of DataBuffer activity the following stages are involved: Create the activity implementation class, and make it extend from abstract class uk.org.ogsadai.activity.activity. Implement the constructor, which takes the corresponding element in perform document as input parameter. Constructor initializes the instance of activity with the information extracted from the element, for DataBuffer activity the information contains the name of activity which supports input data, the place to where output data goes and size of data pool. Implement the initialise method, the initialisation stage of an activity's lifecycle is the first point at which the activity context and session can be accessed [4]. This method is invoked before the processing of data blocks, for DataBuffer activity, the references of input and output port should be set up before reading and writing data blocks to the ports. Implement the processblock method, this is the method where the bulk of an activity's processing is usually performed.[4] In order to make the read and write of data running parallel, two java threads are generated during the execution of this method. One thread is in charge of reading data from input port and writing it into buffer, another one s task is to put the buffered data into the output port and empties the corresponding buffer space. When executing the processblock method, it is invoked repeatedly until the setcompleted method has been called. To reduce the overhead cost by repeatedly calling the same method, the main thread executing the processblock method should be blocked until the parallel java thread finish all their tasks. By doing this, the method is invoked only once. Implementing the sub class ReadBuffer and WriteBuffer, both of them implement the Runnable interface. The instances of these classes are used to initialize the java threads appeared in the execution of processblock method. The ReadBuffer thread gets data block from buffer and output it and it will not return until there is no more data in the buffer or WriteBuffer thread. The task of WriteBuffer thread is to get data from the activity input port and write it into the buffer, the thread will keep running until no more data coming from input port, before it exists the setcompleted will be called to inform the main thread the execution of processblock is completed. 30

5. Spatial linking with OGSA-DAI In this chapter, it will present the experiment about exposing the spatial database by OGSA-DAI service and perform the point-in-polygon linking operation. An efficient way for doing point-in-polygon linking is introduced, and the whole procedure of data migration and linking is presented in several steps. 5.1. Spatial database working with OGSA-DAI The PostgreSQL database has been tested with current release of OGSA-DAI; in the experiment the database used to store geographic information is a PostgreSQL database with PostGIS extension. PostGIS adds support for geographic objects to the PostgreSQL object-relational database. In effect, PostGIS "spatially enables" the PostgreSQL server, allowing it to be used as a backend spatial database for geographic information systems (GIS) [5]. Because PostgreSQL is the supported data resource of OGSA-DAI, so a PostGIS enabled PostgreSQL database can be exposed by OGSA-DAI web service. It is possible to retrieve the geographic information from an OGSA-DAI data service interface. In the application scenario, user needs to join the tables which have related geographic data column. In normal relational database, joining normal relational tables, the = operator can be used to test whether the values are equal or not, the columns which are used to perform join operation must represent the same attribute in the relations, and the data types must be compatible. However in spatial database the joined columns can store the data for different kinds of geometry objects, for =, this operator is a little more naive, it only tests whether the bounding boxes of two geometries are the same [5]. In this experiment, we want to perform the spatial linking which needs to know the relationship between a point and a polygon, if the geographic coordinate of point is located inside of the covered area of polygon, which means they are related; so the corresponding data set can be joined together. Comparing the relationship between geometry objects is more complicated than normal SQL data types, so it needs some special functionality to do that, the experiment will explore how to join the tables that contain geographic information with the help of OGSA-DAI data service. 5.2. How the linking operation is performed In experiment the database used to store the GIS data set is a PostgreSQL database with PostGIS extension. The linking operation goes through the following table: Table epsc: This dataset should be considered as user s private own table. 31

Column Name / Data type Postcode varchar (10) Address varchar (300) Property_type char (2) Table Postcode: This is a intermediate table that map a postcode to coordinate Column Name / Data type Pcd varchar (7) Pcd2 varchar (8) Coord Point Table intermediategeography: This table has a polygon column and corresponding geographic code for referencing the data in neighborhood tables. Column Name / Data type Intgeocode varchar (12) Intgename varchar (50) the_geom geometry Table neighborhood: This is the public dataset containing some user interested statistic data; it also has a column for Intgecode as the intermediategeography table. Column Name / Data type Intgeocode varchar (12) House_price integer Salary integer The linking between espc and postcode table is achieved by the normal relational join operation. To get the corresponding Intgecode of user specified postcode, it needs to establish a link between table postcode and intermediategeography. In the implementation of experiment, the spatial database supports a geometry relationship function: Distance (geometry, geometry), this function will return the distance between two geometries. To check a point is within a polygon use this function, e.g. distance (coorda, PolygonB) < 1 mean the point A is located inside of polygonb. To link the table postcode and intermediategeography, the following query can return the data of joined table: Select postocode.*, intermediatgeography.* From postcode, intermediategeography 32

Where distance (coord, the_geom) < 1; Actually the linking operation for obtaining the final dataset is divided into several join operation between tables. The spatial join operation between point and polygon can only be performed by special geographic relationship function, so this needs to move data from normal relational database to spatial database. In most time, the user private dataset is focused in a local area, which means the postcode of user s table could belong to a few polygons. In that situation, comparing the relationship between the points with every polygon could be inefficient, because most polygons do not contain the user specified point. 5.3. How to improve efficiency of spatial join Computing the distance is expensive, if the table containing point or polygon is large, the simply join them by using the distance function for each row could be very slow. Because it calculates the distance between each row of polygon and point in the table, and most calculation is performed between the unrelated points and polygons. For example, in application scenario the public database could contain information of each region of Britain, and the polygon table contains relative geographic information. However user is only interested in the statistic data about his local area, e.g. Edinburgh area. For such requirement, only querying the geographic data of Scotland is sufficient enough. The following figure illustrates the example mentioned above Polygon table contains geographic information for each region of Britain, but only Scotland contains the coordinates of user specified postcode. 33