PROJECT REPORT. Infrastructure for Large-scale Data Resource Sharing & An Environmental Case Study

Size: px

Start display at page:

Download "PROJECT REPORT. Infrastructure for Large-scale Data Resource Sharing & An Environmental Case Study"

Blanche Daniel
6 years ago
Views:

1 PROJECT REPORT Infrastructure for Large-scale Data Resource Sharing & An Environmental Case Study School of ITEE The University of Queensland Abstract. Large-scale data sharing is nontrivial due to the large-scale, autonomous, and heterogenous nature of data sources. This project represents our initial effort on this problem. The main goal of our project is to identify requirements, system architectures, and key technology barriers to establish an ICT infrastructure to support large-scale data resource sharing between research institutions. To achieve this goal, in the report, we first investigate a few main issues which are important for large-scale data sharing, such as interoperability, extensibility, and scalability, and meanwhile, highlight the necessity for possible synergies among several technologies like Grid, Peer-to-Peer (P2P), and data integration technologies. We then propose a novel service-oriented architecture, which is designed specially for large-scale data sharing. And finally, we give an environmental case study and report our experiences. 1 Motivation, Scope, and Goals Modern, large-scale data sharing is typically characterized by the large volumes of data involved, and the heterogeneity of data sources accessed [23]. For example, in order to answer complex biological questions, biologists have to access and analyze large quantities of biological data which are stored over widely distributed repositories. These repositories, each making its own decision about data storage and retrieval, are highly autonomous and heterogeneous. For instance, they may describe the same data objects using different representations, e.g., protein sequence used in SWISS-PROT, and structure in the Protein Data Bank (PDB). To share data under such circumstances, one possible approach is data replication. That is, data to be shared are first replicated to local repositories or a central repository before any processing (e.g., data mapping, transformation). Though simple, such an approach suffers from apparent limitations like unnecessary bandwidth cost and high maintenance cost. Moreover, sometimes it may not be feasible due to privacy reasons. Data integration, on the other hand, avoids above limitations with data replication, by allowing the flexible and managed federation, exploration, and processing of data from distributed sources [9]. Over the last decade, much effort has been put into data integration from various communities. However, till today, it is still a great challenge to support data sharing in a large scale.

2 To get more insight into the problem, and thus facilitate developing sophisticated techniques for large-scale data sharing, we focus our study in the Environmental Science area. We chose Environmental Science as our focus area based on the following considerations: (1) it exemplifies this problem with non-trivial yet not overly complex data and models; (2) it is an area with a clear need for national and international collaboration, which has not yet become achievable due to issues many of which modern information and communications technology is well positioned to address; (3) it itself is an important area and our results can be applied to generate immediate and significant benefits; and (4) the Queensland EPA is a committed partner who provides large and complex real data, spatial models, operational environment, user requirements and domain expertise. Our research, however, is not limited to environmental sciences, but aims at supporting all data intensive scientific research. The main goal of our project is to identify requirements, system architectures, and key technology barriers to establish an ICT infrastructure to support large-scale data resource sharing between research institutions. Specifically, we hope to achieve the followings: Insight knowledge about important issues involved in large-scale data sharing; Insight knowledge about key technologies (e.g., strength, weakness) and their roles in large-scale data sharing; The design of a large-scale, general purpose infrastructure to support data intensive applications; A working prototype to support data sharing among a selected collection of geospatial data sources (centred around the WildNet database from the EPA). To achieve the above, in the report, we first investigate a few main issues which are important for large-scale data sharing, such as interoperability, extensibility, and scalability, and meanwhile, highlight the necessity for possible synergies among Grid, P2P, and data integration technologies: by combining Grid and data integration technologies, we facilitate the interoperability among heterogeneous data sources; by integrating P2P technologies into both Grid and data integration technologies, we improve the extensibility and scalability of data sharing. We then propose a novel service-oriented architecture based on these technologies, which is designed specially for large-scale data sharing. Finally, we give an environmental case study and report our experiences. The rest of the report is organized as follows: Section 2 describes the state of the art of large-scale data sharing, including important technologies, their roles, and recent efforts; Section 3 investigates main issues involved in large-scale data sharing, and highlights the necessity for technology integration; Section 4 presents the proposed architecture; Section 5 gives an environmental case study; and Section 6 summarizes what we have achieved and points out the future work. 2 The State of the Art In this section, we give a review of the state of the art of large-scale data sharing, including important technologies (i.e., Grid, P2P and data integration technologies), and recent efforts in integrating these technologies for large-scale data sharing.

3 2.1 Grid Technologies Grid technologies and infrastructures aim at supporting coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations [12]. The Open Grid Services Architecture (OGSA) [10] is designed to facilitate the interoperability among different Grid deployments, which aligns Grid technologies with Web Services technologies, and introduces a service-oriented paradigm into the Grid. The first formal and technical specification of OGSA is the Open Grid Services Infrastructure (OGSI) [32], which has several implementations such as Globus Toolkit 3.0 (GT3) [14]. Currently, OGSI is evolving towards the Web Services Resource Framework (WSRF) [38] to embrace new Web Services standards. OGSA and OGSI OGSA adopts a common representation for all resources (e.g., computational and storage resources, programs, databases): each resource in OGSA is represented as a Grid service, i.e., a Web service that provides a set of well-defined interfaces and follows specific conventions [11]. OGSI further specifies the basic interfaces (or porttypes) to be implemented by Grid services, such as GridService (GS), Factory, ServiceGroup, and so on. Most of these interfaces are optional, except GridService, which is a mandatory interface, and must be implemented by all Grid services. Depending on what interfaces are implemented, Grid services with different functionalities may result. Grid services can be instantiated dynamically. Each instantiation of a Grid service generates a Grid service instance which is identified by the Grid Service Handle (GSH) and the Grid Service Reference (GSR). The difference between GSH and GSR is that, GSH is invariant, while GSR is stateful and can change over the life time of a service instance. To create a service instance, a Grid service, called factory, is invoked, which implements the Factory interface (services and factories are located by another Grid service, called registry, which implements ServiceGroup (SGR) porttype). OGSA-DAI/DQP Both OGSA-DAI and OGSA-DQP build upon OGSA. The main objective of OGSA-DAI/DQP is to provide a uniform service interface for data access and integration over the Grids. OGSA-DAI extends Grid services with new services and porttypes for individual data access, such as Grid Data Service (GDS), Grid Data Transport (GDT), Grid Data Service Factory (GDSF), and DAI Service Group Registry (DAISGR). Grid Data Service is the primary OGSA-DAI service, which supports data access through the GDS porttype (via the perform operation) and data delivery through GDT porttype. GDS instances are created by invoking GDSF which can be located by DAISGR. OGSA-DQP extends OGSA-DAI with two new services (and their corresponding factories) for distributed query processing over multiple data sources: a Grid Distributed Query Query Service (GDQS) which compiles, optimises, partitions and schedules distributed query execution plans over multiple execution nodes in the Grids, and a Grid Query Evaluation Service (GQES) which is in charge of a partition of the query execution plan assigned by a GDQS. In GDQS, a Grid Distributed Query (GDQ) porttype is added for importing source schemas. OGSA-DQP itself does not do any schema mediation.

4 2.2 P2P Technologies P2P technologies share the same final objective as Grid, i.e., to pool large sets of resources, however, they address different requirements and thus have different design approaches. In general, P2P technologies focus more on decentralization and scalability, while Grid technologies focus more on providing various complex services. Three main classes of P2P systems have emerged so far: distributed computing, file sharing, and collaborative, among which, file sharing systems is the most studied. Based on whether there is any constraint on network topology or on data placement, file sharing systems are further classified into two main kinds: unstructured (e.g., [15]) or structured (e.g., [30]). Our work is most related to super-peer networks [39], one kind of unstructured networks, which strikes a balance between the inherent search efficiency of centralized systems and the robustness of decentralized systems. Also, this kind of networks can take advantage of the heterogeneity among the capabilities of participating peers. In a super-peer network, some peers with more capability (e.g., more bandwidth, or CPU) take on role as super-peers, and act as servers to a set of clients (peers with less capability) in the network. A good survey about P2P technologies appears in [24]. 2.3 Data Integration Technologies Data integration technologies have been extensively studied over the last decade, and a lot of work has been done. Traditionally, in the database community, data integration systems are characterized by an architecture based on a global schema and a set of sources, and a crucial aspect in these systems is modelling the relation between the sources and the global schema [20]. Two approaches has been proposed: one is globalas-view (GAV) [34] where the global schema is expressed in terms of the sources; the other is local-as-view (LAV) [17] where each source is defined as a view over the global schema. Regardless of the approach used, during query processing, a query posed over the global schema needs to be reformulated in terms of a set of queries over the sources. A fundamental operation related to modelling is schema matching, where a match operation is a function that takes two schemas as input and returns a match result which includes a set of mapping elements matching elements of one schema to the other schema. [28] gives a survey of approaches to automatic schema matching. However, schemas may have some semantics that affects the matching criteria but is not adequately captured or formally expressed. In such a case, two semantically related schemas may seem unrelated. One solution to this is using ontologies. An ontology is a formal, explicit specification of a shared conceptualization [16]. In particular, ontologies are used to capture some shared understanding of a domain: main concepts and their important relationships. By using ontology-based approaches for data annotation, it becomes easier to achieve semantic integration. A single ontology for all is desirable, but unrealistic. Multiple ontologies may be developed either independently or based on a common upper ontology, and mappings between them need to be established to facilitate interoperability. [26] provides a brief survey of ontology-based approaches, and [19] reviews the state of the art of ontology mapping.

5 Semantics Data interoperability Schema Format System interoperability... Language Platform Remote Exec Method Protocol Fig. 1. The interoperability issue 2.4 Recent Efforts Towards Large-scale Data Sharing Recent efforts on data sharing in heterogeneous environments, in general, seek some kinds of synergies among the above technologies. For example, [29] considers integrating P2P and Grid technologies and realizing a service-oriented ad hoc Grid by implementing P2P-based node discovery, property assessment, and service deployment. [18] considers integrating P2P and data integration technologies and extending traditional data integration by introducing a P2P architecture into schema mediation among data sources. Some like [31, 37, 36, 1] consider integrating Grid and data integration technologies: [31] describes a system that integrates distributed metadata catalogs on the Grid; [37, 36] describe the Grid Data Mediation Service (GDMS) system that presents distributed, heterogeneous data sources as one logical virtual data source on the Grid; and [1] proposes a service-based approach to schema federation which is composed of three services, schema translation, schema matching, and schema mapping services. The main limitation with these proposals is that they largely ignore the extensibility and scalability issues which are important for large-scale data sharing. Some like [3, 5, 4] consider integrating all three technologies, as what we consider in our work, for large-scale data sharing. However, different from our work, they still depend on centralized mechanisms for service discovery, thus limiting their scalability. 3 Main Issues Many issues are involved in large-scale data sharing, such as interoperability, extensibility, scalability, security, and so on. Rather than covering all these issues, in this section, we only focus on a few important issues in order to highlight the necessity for synergies among Grid, P2P, and data integration technologies.

6 3.1 Interoperbility In a distributed environment, data sources may present various degree of heterogeneities that occur across systems and data. In such an environment, one of the most important issues to be addressed is interoperability. We differentiate two kinds of interoperability, system-interoperability and datainteroperability (as shown in Table 1). System-interoperability deals with system heterogeneities such as differences in system platforms (e.g., Windows, Unix), protocols (e.g., http, ftp), remote execution methods (e.g., Java RMI, CORBA), and programming languages, while data-interoperability deals with data heterogeneities such as differences in data formats (e.g., relational databases, XML, flat files), schemas (e.g., same objects are described using different structures or terminologies), and semantics (e.g., object semantics may be captured in different degrees, and domain or expert knowledge is thus needed to relate two objects). By interoperability, we mean that diverse data sources are integrated not only in the system level (to achieve system-interoperability), but also in the data level (to achieve data-interoperability). Grid technologies and infrastructures only address the interoperability issue in a limited scope, with more focus on system-interoperability than data-interoperability. By adopting a uniform service-oriented model, all components of the network are made virtual: resources are represented as Grid services which provide some capability through the exchange of messages using platform and language-neutral protocols over the network. One promising way to improve the interoperability among heterogeneous data sources is to integrate Grid technologies with data integration technologies which have been extensively studied in various communities. Contrary to Grid technologies, data integration technologies focus more on data-interoperability than system-interoperability. In the database community, data integration and exchange between heterogeneous data sources is achieved through a logical mediated schema: mappings are established between the mediated schema and schemas at the data sources; and queries are posed over the mediated schema and evaluated over the underlying data sources. To match different schemas, various kinds of individual matchers, which can be used alone or together, are employed by considering either schema-level or instance-level information, or both. Due to inadequate expressiveness of semantics, domain or expert knowledge may sometimes be resorted in order to glue data sources which are seemingly unconnected [21]. In [22], for example, the mediator approach to data integration is augmented with an explicit representation of domain knowledge in the form of one or more ontologies. Techniques developed in the ontology community for semantic integration can be shared and reused in the database community for more automatic schema matching. 3.2 Extensibility and Scalability Another two important issues to be addressed for large-scale data sharing are extensibility and scalability. Here, extensibility refers to the ability to add or remove data sources with minimal effort, or to easily accommodate changes in data sources, and scalability refers to the ability to adapt to the increased number of data sources. These two

7 issues need to be addressed mainly because of the large-scale and autonomous nature of data sources. Apparently, any centralized solution for coordinating large numbers of autonomous data sources is not desirable, as it cannot deal well with the ad hoc sharing of large numbers of data sources, which may appear, disappear, or change contents. Today s Grid technologies are inadequate in addressing these two issues. Though it is possible to improve the system scalability through their resource management and scheduling, this improved scalability is compromised by the fact that resources in Grids are managed either in a centralized or hierarchical manner, which makes Grid technologies not be able to cope and scale well with large numbers of dynamic data sources. Though more interoperability can be achieved, as indicated earlier, by combining Grid and data integration technologies, the extensibility and scalability issues still persist, if not getting worse: the mediated schema must be designed carefully and globally before any data sharing; data sources cannot change significantly or they might violate the mappings to the mediated schema [18]. In other words, similar to Grid technologies, traditional data integration technologies also neglect the ad hoc extensibility and scalability issues which are important for large-scale data sharing. Thus, both Grid technologies and traditional data integration technologies are insufficient for a large-scale data sharing environment. To address this, and meanwhile, to retain the benefits of the integration of both technologies (e.g., interoperability), we use P2P technologies, and adopt a P2P model for both, which provides the benefits that would otherwise unavailable, i.e., extensibility and scalability. To combine P2P technologies with Grid technologies, we organize resources (or services) in Grids in a P2P manner for scalable service discovery and deployment: each service (peer) is connected to a set of other services (neighbors); given a request, a service first checks whether itself is the required service; if not, it will forward the request to its neighbors, and so on, until the request is satisfied. By doing so, we avoid centralized management of resources. Data integration can also be done in a P2P manner. For example, rather than defining a global mediated schema, we can build semantic mappings directly between schemas of different sources 1. Each data source corresponds to a peer, which maintains a few mappings with other peers (neighbors). Given a query, a peer first transforms the query based on the mappings maintained locally, then forwards the reformulated query to semantically related neighbors. As such, data are integrated through the collaboration between peers. From above analysis, we can see that the integration among Grid, P2P, and data integration technologies can provide much potential towards large-scale data sharing. In the next section, we will describe in detail how these three technologies are integrated together. 4 A Service-oriented Architecture for Large-scale Data Sharing In this section, we present the proposed architecture which is designed specially for large-scale data sharing. The architecture, based on the integration of Grid, P2P, and 1 We may use ontology-based domain knowledge for building semantic mappings. In case multiple ontologies exist, ontology mappings are built first, also in a P2P manner.

8 Application Layer Data Analysis, Simulation,... P2P Layer DAS S DAS M Registry DAS M DAS S Mediation Grid Layer OGSA DQP OGSA DAI Data Layer Data Source X Data Source Y Data Source Z Fig. 2. An architecture for large-scale data sharing data integration technologies, is service-oriented, and abstracted into four layers (as shown in Figure 2): Data layer: this layer consists of a set of autonomous, distributed data sources. Each data source makes its own decision about the system and data, and great heterogeneities may exist among different sources. Grid layer: this layer hides the heterogeneities exposed in the data layer, and presents an uniform view (i.e., services) of all resources to the upper layer. All resources are exposed as Grid services except data resources which are exposed as Data Access Services (DASs). P2P layer: this layer organizes services using P2P models for the support of decentralized service discovery and schema mediation. Note that the distinction between this layer and the Grid layer may not be obvious, and sometimes these two layers may be mixed together (e.g., a Grid service is implemented in a P2P mode). Application layer: this layer performs some data intensive operations, e.g., data analysis possibly spanning over multiple data sources. More details about the architecture are elaborated in the following subsections. 4.1 Data Access Services We differentiate two kinds of data access services, DAS-S for the access of a single data source, and DAS-M for the access spanning multiple data sources. Both DAS-S and

9 DAS-M extend the functionalities of OGSA-DAI and OGSA-DQP. Besides implementing GDS and GDT porttypes from OGSA-DAI, and GDQ porttype from OGSA-DQP, DAS-S and DAS-M extend GridService porttype by adding findneighbors, setneighbors, and findservicedatawithinhops, where findneighbors and setneighbors are used to get and set neighbor services respectively, and findservicedatawithinttl is used to query information about a service within a given Time-To-Live (TTL). Meanwhile, DAS-M introduces additional two new porttypes for the support of distributed data integration, Build Mappings (BM), and Query Reformulation (QR). BM porttype is used for establishing semantic mappings between the mediated schema and the input schemas which are either mediated schemas themselves or source schemas (schemas of data sources), with the help of Schema Matching Services (SMS) 2 for schema matching between the input schemas. QR porttype is used for reformulating a query posed over the mediated schema based on the established mappings 3. Both the functionalities of DAS-M and SMS are encapsulated in the Mediation module in Figure 2. Among others, schemas (source schemas or mediated schemas) are exposed as Service Data Elements (SDEs) by DAS-S and DAS-M. 4.2 Service Organization and Discovery Since all resources in Grids are represented as services, how to discover the required services in a large-scale ad hoc Grid is an issue. As mentioned earlier, the centralized discovery mechanism is not desirable. In this part, we describe a decentralized service discovery mechanism. We employ a super-peer network [39] to organize services. Specifically, each service in Grids corresponds to a peer; for data accesses, a peer may provide data (through its DAS-S service), or act as a mediator (through its DAS-M service); there are two kinds of peers in the network, super-peers and their clients (often called peers directly); a super-peer acts both as a server to the peers within its group (a peer group consists of a super-peer and its peers), and as an equal to other super-peers within the network. When a peer joins a peer group, it registers some service metadata to its super-peer. To find a service, a peer sends a request (through findservicedatawithttl operation) to its super-peer. The super-peer searches its service registry, meanwhile forwards the request to its super-peer neighbors with TTL decreased by 1. The process is repeated until TTL is equal to 0. Usually, services which implement the ServiceGroup porttype act as super-peers. Peer groups are formed based on Virtual Organizations (VOs), with each peer group per VO. However, we may form peer groups based on the semantic closeness of services for efficient service discovery, that is, services which are semantically close are clustered together. We leave this as the future work. 2 As there may be many SMSs, each of which uses different matching methods, SMSs are designed not to be tied to any data access services. 3 Concept-based queries can also be supported when ontologies are used, which requires additional steps in QR for query transformation.

10 SGR Registry 2 2 GS Registry SGR 2 2 GS 3 Factory 1 DASMF GS Registry SGR 2 3 Client 4 QR GDQ DAS M 5 GDQ SMS BM 6 Fig. 3. Service interactions during service initialization and set-up 4.3 Service Interaction In this part, we describe how services are interacted in two typical scenarios: (1) when a data access service is initialized and set up; (2) when a query is submitted. Service Initialization and Set-up Compared to DAS-S, the initialization and set-up of DAS-M is more complex (DAS-S only needs the first a few service interactions of DAS-M). Thus, in the following, we focus on DAS-M only. Figure 3 roughly illustrates the service interactions for the initialization and set-up of DAS-M: (1) a DASMF registers itself with its super-peer (registry) when initialized; (2) a client discovers the DASMF for the service instance creation through the findservicedatawithttl operation to the super-peer; (3) the client creates the DAS-M instance through the createservice operation of the DASMF; (4) the DAS-M imports schemas for mediation through the importschema operation in the GDQ porttype; (5) the SAS imports schemas for matching through the importschema operation; (6) the DAS-M builds mappings via the BM porttype based on the matching result returned from the SAS. Note that, whenever a service is initialized, a service interaction like the above interaction 1 is invoked, and whenever a service instance is created, service interactions like the above interactions 2 and 3 are invoked. Query Processing Suppose a query is posed over a mediated schema which involves two data sources. Figure 4 shows the service interactions during query processing: (1) a query is submit through the perform operation; (2) the DAS-M reformulates the query via the QR porttype based on the mappings built during the set-up; (3) the reformulated query is passed to the GDQS through atheperform operation; (4) the GDQS compiles the query into a distributed query plan, creates a GQES for each partition of the query plan, and passes the corresponding partition to it; (5) each GQES instance starts the

11 GDS GDS GDS 5 GQES GDT 2 Client GDT 1 QR GDS DAS M GDT 3 GDS GDQS GDT GDS GQES GDT 5 GDS GDS 6 Fig. 4. Service interactions during query processing evaluation, and interacts with the GDS; (6) results are propagated back to the client via the GDT porttype. The query formulation work is easy with our architecture, as a client can import schemas of any data access services within its peer group after asking service information from its super-peer (as mentioned in Section 4.2, a super-peer maintains some metadata about the services within its peer group, thus it can have enough knowledge about these services). 5 An Environmental Case Study We begin our work with an environmental case study, which is centered around an existing research and development program in the Queensland Environmental Protection Agency (EPA), WildNet. 5.1 WildNet and Datasets The WildNet database contains 3.5 million records of wildlife sightings and listings of around 20,000 species such as plants, mammals, birds, reptiles, amphibians, freshwater fish, marine cartilaginous fish and butterflies in Queensland. Species are classified by a taxonomy including multiple levels, kingdom names, class names, family names, scientific names, and common names. The main feature of WildNet is that it maintains a large store of ecological data which depends heavily on many other services and is itself a service to many other applications. One fundamental function required by Wild- Net is sightings visualization, i.e., sightings can be visualized on the map for a selected area such as protected area, forestry area, or local government area, or a defined area

12 bounded by minimum and maximum latitudes and longitudes. Beyond this, it is desirable to model species distribution, and identify abnormalities or outliers through the analysis of the sightings data together with the environmental data such as vegetation data or climate profiles. To realize above functionalities, we include the following datasets in our case study: Snake sightings data in southeast Queensland region, provided by the Queensland EPA; Bird sightings data along Queensland coastline, provided by the Queensland EPA; Bird taxonomy data, extracted from Australia Museum via BioMaps [2]; Weather data, downloaded from Australia Bureau of Meteorology. 5.2 The Implementation Architecture The implementation is based on Open Geospatial Consortium (OGC) [27] standards, as data involved in our case study have a geographic or spatial nature. OGC is a nonprofit, international, voluntary consensus standards organization that is leading the development of standards for geospatial and location based services. Currently, the OGC has developed a number of web services specifications that enable the interoperability of geospatial data sources in a distributed environment, such as Web Coverage Services(WCS) [8], Web Feature Services (WFS) [35], Web Map Services (WMS) [6], and Catalogue Services (CSW) [25]. Among them, WFS, WMS, and WCS specify the interfaces for the access to geospatial data sources, and CSW specifies the interfaces through which other services can be published and discovered. Figure 5 shows the implementation architecture. Due to time constraints of this one-year project, current implementation is only OGC-compliant, i.e., data sources are exposed as OGC Web Feature Services. Grid and P2P elements can be included in the future development. In the implementation, we use two machines, A and B: machine A stores snake sightings data, bird taxonomy data, and weather data; machine B stores bird sightings data. A new data source always publishes its services to the catalogue server (step (1)). During a search (step (2)), the catalogue server is first contacted for related data sources (step (3)); then the search request is dispatched to these data sources (step (4)); and finally data are accessed (step (5)). 5.3 Service Development We chose GeoServer [13] for the server-side WFS server development and udig [33] for the client-side interface development. The WFS implementation was developed using Java Servlet technology under Eclipse + Tomcat: basic WFS requests (e.g., getcapability, getfeaturedescription, getfeature) are supported through HTTP GET /POST, and results which are in XML or XML-compliant GML formats are returned, as shown in Figure 6. During the implementation, we found OGC is still immature: there is no open source software for catalogue service development. Also, its specification for catalogue service includes much more details than what we need for our prototype, and much more effort and time need to be invested if we implement a fully OGC-compliant catalogue

13 Client (2) Mediation (3) Catalogue Service (4) (4) WFS (1) WFS (1) (5) (5) Machine A Machine B Fig. 5. The implementation architecture Client HTTP Get/Post GML file Data Resource Access (Servlet) HTTP Get/Post Response WFS Server Fig. 6. The WFS implemenation service from scratch. More importantly, we expect that the Grid registry service can be extended for geospatial data sources in the future development, we decided to implement a simple catalogue service which can meet our current needs but may not follow the OGC specifications. For the catalogue service development, first, we decided the contents of the metadata database. The design details are as follows: each interested feature type is registered into a table (as shown in Figure 7), and each feature attribute of each feature type is registered into a separate table. For each feature type, there is a BBOX attribute, which is a geometric data type (a Polygon) defining the geographic bounding box or the boundary of the feature type. With such a design, we can query the metadata of a feature based on its geographic location or spatial relationship with other feature types. The URL and Namespace decide where to get the feature type online. Next, we built the spatial-compliant metadata database. Two approaches have been tried: the first approach is storing all metadata information in a XML-compliant GML file, and then treat this GML file as a datastore in GeoServer and utilize GeoServer s WFS function to support the spatial query capabilities. This idea comes from deegree catalogue server [7]. However, after testing on GeoServer, we found the result is not ideal due to GeoServer is not well support to GML yet at the moment. The second approach is building a spatial database using PostGIS. PostGIS is a free spatial extension of PostgreSQL allowing building spatial record with friendly user interface. This approach is successful, and we have created a sample metadata database.

14 FeatureType OID Name BBOX Bbstract Keyword URL Namespace 0..m Attribute Name Data type Abstract Feature_Type_ID Fig. 7. A simple metadata database for catalogue service Finally, we implemented the catalogue service (using JAVA). Basically, the catalogue server can receive a catalogue query from a WFS server, decompose the query to the SQL format for searching its metadata database, and finally reconstruct the result and pass back to the WFS server. 5.4 The Prototype The prototype we built has two main functionalities: Selective catalogue search. The catalogue service supports two kinds of search: keyword-based search and location-based search. For keyword-based search, given a keyword, only relevant data sources are returned from the catalogue search. Figure 8 shows the keyword input dialogue. When the input keyword is snake, only the snake source is returned, and when the input keyword is bird, only bird data sources are returned, as shown in Figure 9 and Figure 10 respectively. For locationbased search, we can specify a specific area we are interested in, and only the data sources whose data coverages overlap with the polygon selection are returned, as illustrated in Figure 11. Data sharing based on WFS. Figure 12 to Figure 14 show the data sharing functionality of the prototype. Given a common name, for example, we can get more information about birds with the name, e.g., the taxonomy information, the climate profile, while such information is usually distributed across different data sources. 6 Conclusion and Future Work The main goal of this project is to identify requirements, system architectures, and key technology barriers to establish an ICT infrastructure to support large-scale data resource sharing between research institutions. To achieve this goal, we did the followings: Examined important issues for large-scale data sharing, such as interoperability, extensibility, and scalability;

15 Fig. 8. The dialogue for keyword input Fig. 9. The catalogue search result when input keyword is snake

16 Fig. 10. The catalogue search result when input keyword is bird Fig. 11. The catalogue search result when a polygon selection is created

17 Fig. 12. The dialogue for bird name input Fig. 13. The distribution of Australian White Ibis

18 Fig. 14. The climate profile for Australian White Ibis Analyzed key technologies like Grid, P2P, and data integration technologies, and pointed out synergies among them are necessary to address above issues; Proposed a service-oriented architecture for large-scale data sharing, which is based on the integration of Grid, P2P and data integration technologies; Investigated an environmental case study, and built a working prototype to facilitate geospatial data sharing among several data sources centered around the WildNet database from the Queensland EPA. Future work includes, besides Grid and P2P extension on current prototype development, in-depth research on various topics in data integration area like data cleaning, data reconciliation, automatic semantic mapping building; also, in-depth research on service composition, in-depth research on workflows, and so on. Acknowledgements This material is based upon the project supported by the Australian Research Council (ARC) under grant No. SR We would like to thank Jack Fan Zhang involved for some development, Dr. David Pullar for the environmental data, and the Environmental Protection Agency (EPA) in Queensland for the sightings data. References 1. L. A.-Hussaini, S. Viglas, and M. Atkinson. A service-based approach to schema federation of distributed databases. In Ediburgh E-Science Technical Report, EES , 2006.

19 2. BioMaps D. Calvanese, G. D. Giacomo, M. Lenzerini, R. Rosati, and G. Vetere. Hyper: A framework for peer-to-peer data integration on grids. In Proceedings of the International Conference on Semantics of a Networked World: Semantics for Grid Databases (ICSNW 2004), C. Comito and D. Talia. Data integration and query reformalution in service-based grids: Architecture and roadmap. In CoreGRID Technical Report, TR-0013, C. Comito, D. Talia, and P. Trunfio. Grid services: principles, implementations and use. In International Journal of Web and Grid Services, Vol. 1. No. 1, J. de La Beaujadiere. OpenGIS Web Map Service (WMS) Implemenation Specification, version Deegree J. Evans. OpenGIS Web Coverage Service (WCS) Implemenation Specification, version I. Foster and R. L. Grossman. Data integration in a bandwidth-rich world. In Communications of the ACM, Volume 46, Issue 11, I. Foster, C. Kesselman, J. Nick, and S. Tuecke. The physiology of the grid: An open grid services architecture for distributed systems integration. In Technical Report, Globus Project I. Foster, C. Kesselman, J. Nick, and S. Tuecke. The physiology of the grid: An open grid services architecture for distributed systems integration. In Globus Project, I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: enabling scalable virtual organizations. In International Journal of Supercomputer Applications, 15(3): , GeoServer Globus Toolkit Gnutella T. R. Gruber. A translation approach to portable ontology specifications. In Knowledge Acquisition, 5(2): , A. Y. Halevy. Answering queries using views: A survey. In The VLDB Journal, Volume 10, Issue 4, A. Y. Halevy, Z. G. Ives, D. Suciu, and I. Tatarinov. Schema mediation for large-scale semantic data sharing. In VLDB Journal, Y. Kalfoglou and M. Schorlemmer. Ontology mapping: the state of the art. In The Knowledge Engineering Review, 18(1):1-31, M. Lenzerini. Data integration: a theoretical perspective. In Proceedings of the 21st ACM symposium on Principles of database systems, B. Ludscher, A. Gupta, and M. E. Martonem. A model-based mediator system for scientific data management. In Z. Lacroix and T. Critchlow editors, Bioinformatics: Managing Scientific Data. Morgan Kaufmann, B. Ludscher, K. Lin, S. Bowers, E. Jaeger-Frank, B. Brodaric, and C. Baru. Managing scientific data: From data integration to scientific workflows. In GSA Today, Special Issue on Geoinformatics, D. A. Menasce. Scalable access to scientific data. In IEEE Internet Computing, May/June 2005 (Vol. 9, No. 3), D. S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pruyne, B. Richard, S. Rollins, and Z. Xu. Peer-to-peer computing. In HPL R1, D. Nebert and A. Whiteside. OGC Catalogue Services Specifications, version N. F. Noy. Semantic integration: a survey of ontology-based approaches. In ACM SIGMOD Record, 33(4):65-70, 2004.

20 27. Open Geospatial Consortium E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. In VLDB Journal 10: , M. Smith, T. Friese, and B. Freisleben. Towards a service oriented ad-hoc grid. In Proceedings of 3rd International Symposium on Parallel and Distributed Computing, I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan. Chord: A scalable peerto-peer lookup service for internet applications. In Proceedings of SIGCOMM, R. Tuchinada, S. Thakkar, Y. Gil, and E. Deelman. Artemis: Integrating scientific data on the grid. In AAAI, S. Tuecke, K. Czajkowski, I. Foster, J. Frey, S. Graham, C. Kesselman, T. Maquire, T.Sandholm, D.Snelling, and P.Vanderbil. Open grid services infrastructure (ogsi) version 1.0. In IGlobal Grid Forum Draft Recommendation, udig J. D. Ullman. Information integration using logical views. In Proceedings of the 6th iinternational Conference on Database Theory (ICDT), P. Vretanos. OpenGIS Web Feature Service (WFS) Implemenation Specification, version A. Whrer, P. Brezany, and A. M. Tjoa. Virtualization of heterogeneous data sources for grid information systems. In MIPRO, A. Whrer, P. Brezany, and A. M. Tjoa. Novel mediator architectures for grid information systems. In Future Generation Computer Systems, 21(1), WSRF B. Yang and H. Garcia-Molina. Designing a super-peer network. In Proceedings of the 18th International Conference on Data Engineering (ICDE), 2003.

A Grid-enabled Architecture for Geospatial Data Sharing

A Grid-enabled Architecture for Geospatial Data Sharing Yanfeng Shu, Jack Fan Zhang, Xiaofang Zhou School of ITEE The University of Queensland {yshu, jfz, zxf}@itee.uq.edu.au Abstract This paper explores