Project GRACE: A grid based search tool for the global digital library

Project GRACE: A grid based search tool for the global digital library Frank Scholze 1, Glenn Haya 2, Jens Vigen 3, Petra Prazak 4 1 Stuttgart University Library, Postfach 10 49 41, 70043 Stuttgart, Germany; frank.scholze@ub.uni.stuttgart.de 2 Stockholm University Library, 10691 Stockholm, Sweden; glenn.haya@sub.su.se 3 CERN Library, CH-1211, Geneva 23, Switzerland; jens.vigen@cern.ch 4 Stuttgart University Library, Postfach 10 49 41, 70043 Stuttgart, Germany; petra.prazak@ub.uni-stuttgart.de Abstract: GRACE - Grid Search and Categorization Engine (http://www.grace-ist.org) is an ongoing EU project. GRACE is an attempt to apply an innovative Grid-based solution that will meet the challenges of searching a global heterogeneous collection of documents. The goal of the project is to build a distributed search and categorization engine that will run on the Enabling Grids for E-science in Europe (EGEE), the successor to the European Data Grid (EDG). This paper is a description of the project and its potential as a framework for a global ETD search tool. 1 Introduction The goal of Project GRACE i is to build a distributed search and categorization tool adapted to the Grid network infrastructure. We are currently in the process of developing the first prototype of the GRACE search engine. Testing and evaluation will proceed throughout the summer and fall of 2004. CERN-OPEN-2004-017 01/06/2004 One of the unique aspects of the GRACE toolkit is that it is built on a large distributed-computing network referred to as a grid. The project is one of the first to deal with search and retrieval in the grid environment and in the process of creating the search tool we have identified potential advantages of a grid-based search and categorization engine as well as limitations for search and retrieval in the current grid environment. GRACE is a project in the Fifth Framework Program (FP5) of the Information Society Technologies (IST) initiative by the European Union. The partners in this project are Telecom Italia Lab (as project leader and manager), CERN (European Organization for Nuclear Research), Virtual Self, Sheffield Hallam University - School of Computing and Management Sciences, Stockholm University Library and Stuttgart University Library. The project started in September 2002 and will end in February 2005. 2 Benefits of GRACE for searching ETDs GRACE could be used as a framework for a global ETD search tool by searching existing data providers and service providers as well as other content sources. Below are some of the aspects of the GRACE search and categorization tool with the most potential benefit for the ETD community. 2.1 Federated search of heterogeneous sources GRACE can integrate content sources using different protocols such as OAI, http or Z39.50. This means that GRACE can function as a layer on top of service providers such as NDLTD as well as individual repositories. Content sources can be integrated even if they do not have a search interface of their own. GRACE provides its own indexing service for that which is based on Jakarta s Lucene adapted to the GRACE toolkit. ii If for example, an institution has a collection of PDF files but no way to search through them, they could integrate their collection into GRACE. Users could then search this content source along with any other content source (service providers or data providers) that is integrated with GRACE. 2.2 Federated search of subscription versus free material Some material, notably Proquest s Digital Dissertation, can be accessed through subscription accounts. GRACE is built on top of a Grid network in which users are registered as part of virtual organizations which can be a university or a faculty for example. Virtual organizations can ease the administration of access rights to various sources. Potentially, this means that a registered user can log in as part of a virtual organization and GRACE can automatically allow or restrict access to content based on their organizations access levels. This automatic authentification feature will not be included in the first prototype of GRACE but may be added at a later date.

2.3 Sources organized in knowledge domains Content sources will be organized into knowledge domains (e.g. subjects such as physics or computer science). Users will therefore not have to know each content source relevant for their specific query. However in addition to selecting knowledge domains they will be able to select individual content sources that best match their topic. However, ETDs are often stored in sources that are interdisciplinary and will be included under a general knowledge domain. For this reason, content sources will also be defined by predominant document type (such as thesis, dissertations), so users can easily identify sources that contain ETDs. 2.4 Automatic categorization GRACE includes a categorization engine that will dynamically integrate and categorize results from various data sources. This partially solves the problem of integrating results from heterogeneous content sources that rank results using different methods. The categorization engine will be based on linguistic algorithms iii as opposed to statistical methods used in other search and categorization engines such as Vivissimo. To begin with, GRACE will be capable of automatically identifying and then categorizing results in the following languages: English, German, Swedish and Italian. Additional languages may be added at a later stage. More details on categorization can be found in section 4.3. 2.5 Use of subject thesauri or classifications schemes To launch a search: GRACE will allow users to search using subject appropriate thesauri or classification schemes that change depending on the knowledge domain(s) or sources selected. For example, if a researcher were to search content sources that focus on particle physics for example, he can start by selecting terms from the High Energy Physics Index. When searching using a classification scheme that is not supported by a content source, GRACE will take the word or phrase from the classification scheme and perform a keyword search on the documents. To present results: GRACE will also allow users the option to view search results categorized by a specific classification scheme. For example, a user can choose to view search results categorized using terms from the High Energy Physics Index. When this option is chosen, GRACE s categorization engine will automatically categorize the documents using the terminology from the classification scheme based on a linguistic analysis of the entire text of the documents. This feature is not currently developed but is planned for the GRACE prototype that will be available for public testing in September 2004. 2.6 Multilingual functionality The search tool will have the ability to search and automatically categorize documents in various languages with the help of lexical databases. The prototype will have capabilities in English, German, Italian and Swedish and this feature is extensible to other languages. 3 Comparison of GRACE to existing tools Below is a table comparing GRACE to other ETD search tools. NDLTD NCSTRL GRACE Content Searched Repositories via OAI Repositories via OAI Various sources via http, OAI, Z39.50, Free or Restricted? Free Free Free and restricted content Index Centralized Decentralized Decentralized Query Processing Centralized Decentralized over the web Grid computational resources Response Time Immediate, Quick Immediate, Slower Delayed, Batch Processing Table 1: Features of GRACE vs NDLTD and NCSTRL

3.1 Content searched The chart above shows how GRACE can integrate content sources via OAI, http and Z39.50. This allows it to function as a unified search tool for various content sources including both ETD service providers and data providers as well as other search interfaces. 3.2 Query processing GRACE processes search results on the computers that make up the Grid network. This means that the system is very powerful and scalable without the investment in a large number of servers by any single institution. 3.3 Response Time Finally, the chart shows that the processing time for a query with GRACE is delayed due to the batch orientation of current Grid technology. Unfortunately the current grid architecture does not provide real time interaction since there will always be an overhead of a few minutes to every job submitted to the Grid, no matter how simple. As a result of this, we decided to create a search and categorization tool that delivers results over time via links sent by email as opposed to providing immediate results. The GRACE approach is comparable to SDI or profile searches in online databases like Inspec or Medline. These work as a kind of alerting service in the background gathering new information into a structure. We believe this to be a potentially powerful paradigm for the future giving the user control over the structure into which documents or information is fed into. A PhD Student for example could make up a table of contents for his thesis which would be enriched by his own work as well as by documents retrieved in the background. 4 Workflow Below is a diagram of the overall workflow of the GRACE tool. Figure 1: GRACE workflow

4.1 From Query Submission to Downloading After query formulation and submission (which is explained in section 7), content sources selected by the user are queried, results are parsed and documents are downloaded. This takes place on the internet. The downloaded documents are then sent to the grid network for normalization and categorization. 4.2 Normalization During text normalization, the text is put into a uniform format in preparation for categorization. For example, stop words such as articles (e.g. the and a ) are removed. Words are grouped together when appropriate (for example, proper names, acronyms). Words are also stripped of prefixes and suffixes at this stage. Many of these items are language specific and GRACE can perform these functions on any supported language (at this stage suitable lexical tables are available for English, German and Italian. Swedish will be added soon). 4.3 Categorization The entire text of documents retrieved from a search are downloaded, normalized (see 6.2) and sent to the categorization engine where lexical algorithms are used to categorize the results. An example of the type of work done in the categorization stage is a process called disambiguation. Words can be used to mean different things, but clues to the meaning of a word or phrase can often be found within the context in which the word was used. The categorization engine analyzes the context of words in order to group together those words that have similar contexts. For example, the word October below can refer to the month, the submarine or the Russian Revolution. GRACE s categorization engine uses the context to help sort the words into categories and present them to the user. 4.4 4.5 Response Time Figure 2: Disambiguation 4.4 External vs. internal content sources In Figure 1 it is shown that GRACE will query both external and internal content sources. An internal content source is a source in which documents have already been parsed and normalized and are stored on the grid, ready to be included in a query and categorized. For example, if a university department stored their theses on a server as PDF files, GRACE could download the documents, process them and store a normalized version of them on the GRID. The theses would then be presented to the user as a content source that can be included in the federated search. From the user perspective, internal sources (normalized and stored on the grid) and external (queried through http for example) are indistinguishable since the user interface presents them both in the same way.

5 User interface 5.1 Query Input The query is input in three stages. First the user selects the content sources. Either individual sources or entire knowledge domains can be selected. The screen shot of the first GRACE prototype (figure 3) includes web sources and internal documents. However, in the prototype available for the public the sources will be divided into knowledge domains such as physics and computer science. The user then enters the search terms. Figure 3: Search Wizard screen #1 On the next screen (Figure 4) the user enters in the search term(s). Any term can be typed in or the user can choose to use a term from an appropriate classification scheme to launch a search. Searches can also be limited to a specific field. The fields available vary depending on the resources selected. Figure 4: Search Wizard screen #2 Finally, the user launches the search (Figure 5). If the user is logged in then the e-mail information is automatically filled in. After the search is launched the user receives a confirmation. After the search is processed, an email is sent out with a link to the categorized results. The user can get updates on a search as often as once a day.

Figure 5: Search Wizard screen #3 5.2 Results The results (Figure 6) are sent to the user as a link via email. Results are presented in categories that are displayed on the left hand side of the screen. Each category is linked to relevant related concepts that are listed at the bottom of the page. From the result page, the user can sort or filter the search results. Automatically created Table of Contents Automatically created Related Concepts list, per selected topic in the upper list Figure 6: Results screen

6 Federated Search Federated search of heterogeneous content sources has traditionally been problematic. Below is a series of problems associated with Federated search iv along with the solution proposed by GRACE. Problem: US Digital Library Experience suggests cross searching does not scale. Solution: User limits source selection by choosing a specific knowledge domain or by choosing sources that focus on a specific document type such as theses. Problem: Collection description is difficult and users have trouble knowing which sources to search Solution: GRACE allows users to choose sources by subject (knowledge domain) or by primary document type that the source contains. Problem: Query language and search attributes can vary across different sources. Solution: Query syntax is mapped to individual information resources. However, as content sources scale up there will be considerable maintenance effort. Problem: Different sources rank results in different ways. Ranking the combined results of various sources is problematic. Solution: By presenting results divided into categories, GRACE provides an alternative to traditional ranking of results and provides a partial solution to this problem. Problem: Performance is limited to slowest target Solution: The project does not provide a direct solution to this problem but instead adopted a batch processing approach that aims to provide high quality results updated over time. The presentation of the search tool clearly explains that the complete results will not be available immediately but the user will be notified by email when a search is completed. 7 What is Grid Technology? A Grid is a distributed computer network in which computers share computing power and storage capacity. There are currently several large grid networks, however there is no global grid. The basics of grid computing are explained well at CERN s gridcafe. v As explained on this website, the dream of grid computing is to create a global network of computers, accessible from anywhere, which will function as a practically unlimited computing resource. However, the grids available now and in the immediate future are regional and have e- science as a primary focus. The concept of a computing grid is often compared to the power grid where the user does not need to worry about what computer processes his request and where the data is stored. Like the power grid, the computing grid would be accessible from anywhere and you will pay for the power that you need. However, there are some key differences between the power grid and the computing grid however. For example power is either on or off. There are no performance issues the way there can be with computer networks such as the Grid. Also, power flows basically only one way, from producer to consumer, which is not true on a grid network where there is interaction. Distributed computing, which grid networks are based on, is not a new concept. Distributed computing developed in an effort to generate processing power for meeting workload challenges. In order to boost processing power, institutions aggregated computing resources across locations or across the entire institution. The idea was to match the supply of processing cycles with the demand created by applications. This concept is now a ubiquitous solution practiced by leading organizations around the world. It ensures continuous computing availability despite scheduled maintenance, power outages, and unexpected failures. 8 GRACE and the Grid As mentioned previously, there are several grid networks in use today. GRACE will be integrated with the GILDA grid testbed vi (Gilda=Grid INFN Laboratory for Dissemination Activities) which has computing nodes at six sites spread across Italy. GILDA is in turn a part of a larger infrastructure called EGEE vii (Enabling Grids for E-Science in Europe).

The GRACE project has its own grid nodes which it has integrated with GILDA. These include 5 CPUs in Turin and 4 CPUS in Milan. 9 Future implications of Grid search and retrieval for the ETD community In its current stage of development, the grid is well suited for batch processing and storage of enormous amounts of data. This means that it is appropriate for using on massive collections of documents for researchers who are willing to wait for high quality in terms of categorization of results. In the future, the process for submitting a job to the grid may be streamlined, making it an environment suitable for interactive applications. This demand is formulated for example by the e-learning community utilizing Grid technology viii, by the GGF working group on Grid information retrieval (GIR) ix as well as by vendors such as IBM which introduced Masala x as an extension to DB2. It is our hope that the lessons learned from the GRACE project will contribute to the grids development in this direction. i GRACE project Hhttp://www.grace-ist.org/H ii Hhttp://jakarta.apache.org/lucene/H Add reference to GRACE Deliverable D2.1 GRACE Local Search and Categorization Engine iii Nahum Korda et al.: Unsupervised Taxonomy of Large Document Corpora Utilizing Idiomatic Character of Natural Languages In: The 2001 International Conference on Artificial Intelligence (IC-AI'2001) June 25-28, 2001 Las Vegas iv The problems were taken from Tutorial 1, CERN Workshop on Innovations in Scholarly Communications, February, 2004. Add url v Hhttp://gridcafe.web.cern.ch/gridcafe/H vi Hhttp://gilda.ct.infn.it/main.htmlH vii http://public.eu-egee.org/ viii http://www.lege-wg.org/ ix http://www.gridir.org/ x http://www.ibm.com/software/data/integration/masala.html