Hypermedia for Information Retrieval. University of Padua - Italy. University of Glasgow - Scotland

Similar documents
second_language research_teaching sla vivian_cook language_department idl

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

A Model and a Visual Query Language for Structured Text. handle structure. language. These indices have been studied in literature and their

Automatic Construction of News Hypertext. Theodore Dalamagas

CHAPTER 8 Multimedia Information Retrieval

b A HYPERTEXT FOR AN INTERACTIVE

2. PRELIMINARIES MANICURE is specically designed to prepare text collections from printed materials for information retrieval applications. In this ca

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

Transactions on Information and Communications Technologies vol 7, 1994 WIT Press, ISSN

A Top-Down Visual Approach to GUI development

An Adaptive Agent for Web Exploration Based on Concept Hierarchies

A Model for Information Retrieval Agent System Based on Keywords Distribution

Making Retrieval Faster Through Document Clustering

Video Representation. Video Analysis

Generalized Document Data Model for Integrating Autonomous Applications

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

Improving Adaptive Hypermedia by Adding Semantics

Using XML Logical Structure to Retrieve (Multimedia) Objects

An Architecture to Share Metadata among Geographically Distributed Archives

A Semi-automatic Support to Adapt E-Documents in an Accessible and Usable Format for Vision Impaired Users

Universita degli Studi di Roma Tre. Dipartimento di Informatica e Automazione. Design and Maintenance of. Data-Intensive Web Sites

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

ARCHITECTURE AND IMPLEMENTATION OF A NEW USER INTERFACE FOR INTERNET SEARCH ENGINES

VIDEO SEARCHING AND BROWSING USING VIEWFINDER

Semantic-Based Information Retrieval for Java Learning Management System

Chapter 27 Introduction to Information Retrieval and Web Search

ATLAS.ti 6 Distinguishing features and functions

MULTIMEDIA TECHNOLOGIES FOR THE USE OF INTERPRETERS AND TRANSLATORS. By Angela Carabelli SSLMIT, Trieste

Theme Identification in RDF Graphs

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

International ejournals

Text Mining. Representation of Text Documents

BUILDING A CONCEPTUAL MODEL OF THE WORLD WIDE WEB FOR VISUALLY IMPAIRED USERS

Graph-based Automatic Suggestion of Relationships among Images of Illuminated Manuscripts

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

THE FACT-SHEET: A NEW LOOK FOR SLEUTH S SEARCH ENGINE. Colleen DeJong CS851--Information Retrieval December 13, 1996

A Tagging Approach to Ontology Mapping

Designing a System Engineering Environment in a structured way

In both systems the knowledge of certain server addresses is required for browsing. In WWW Hyperlinks as the only structuring tool (Robert Cailliau: \

I&R SYSTEMS ON THE INTERNET/INTRANET CITES AS THE TOOL FOR DISTANCE LEARNING. Andrii Donchenko

Fausto Giunchiglia and Mattia Fumagalli

3.4 Data-Centric workflow

Link Recommendation Method Based on Web Content and Usage Mining

Txt2vz: a new tool for generating graph clouds

Towards the integration of security patterns in UML Component-based Applications

TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood

A New Measure of the Cluster Hypothesis

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department

Publishing Model for Web Applications: A User-Centered Approach

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

Information Retrieval (Part 1)

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

A World Wide Web-based HCI-library Designed for Interaction Studies

HyperFrame - A Framework for Hypermedia Authoring

Java4350: Form Processing with JSP

Ontology Extraction from Heterogeneous Documents

The Utrecht Blend: Basic Ingredients for an XML Retrieval System

The ToCAI Description Scheme for Indexing and Retrieval of Multimedia Documents 1

Using Uncertainty in Information Retrieval

Digital Archives: Extending the 5S model through NESTOR

Motivating Ontology-Driven Information Extraction

A probabilistic description-oriented approach for categorising Web documents

A Graphical User Interface for Structured Document Retrieval

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

EVALUATION OF SEARCHER PERFORMANCE IN DIGITAL LIBRARIES

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa

Modeling Systems Using Design Patterns

Evaluation and Design Issues of Nordic DC Metadata Creation Tool

Processing Structural Constraints

DesignMinders: A Design Knowledge Collaboration Approach

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

ResPubliQA 2010

Computational Electronic Mail And Its Application In Library Automation

PROJECT PERIODIC REPORT

Mymory: Enhancing a Semantic Wiki with Context Annotations

Inference Networks for Document Retrieval. A Dissertation Presented. Howard Robert Turtle. Submitted to the Graduate School of the

Formulating XML-IR Queries

Image Access and Data Mining: An Approach

A model of information searching behaviour to facilitate end-user support in KOS-enhanced systems

Interrogation System Architecture of Heterogeneous Data for Decision Making

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

Information Retrieval

Domain Specific Search Engine for Students

Using Attribute Grammars to Uniformly Represent Structured Documents - Application to Information Retrieval

Enhancing Internet Search Engines to Achieve Concept-based Retrieval

A User Study on Features Supporting Subjective Relevance for Information Retrieval Interfaces

ANIMATION OF ALGORITHMS ON GRAPHS

Server 1 Server 2 CPU. mem I/O. allocate rec n read elem. n*47.0. n*20.0. select. n*1.0. write elem. n*26.5 send. n*

Automatic Query Type Identification Based on Click Through Information

Information Retrieval and Web Search

A QUERY BY EXAMPLE APPROACH FOR XML QUERYING

Transcription:

Automatic Authoring and Construction of Hypermedia for Information Retrieval Maristella Agosti, Massimo Melucci Department of Electronics and Informatics University of Padua - Italy Fabio Crestani Department of Computing Science University of Glasgow - Scotland Address to which correspondence should be sent: Maristella Agosti, Department of Electronics and Informatics, University of Padua, Via Gradenigo, 6/a, 35131 Padova, Italy. Voice: +39-49-828 7600 ext. 7650. Fax: +39-49-828 7699. E-mail: agosti@ipdunivx.unipd.it. 1

Abstract The paper describes a complete process and a tool for the automatic construction of a multimedia hypertext starting from a large multimedia document collection. Through the use of an authoring methodology the document collection is automatically authored and the result is a multimedia hypertext, also called hypermedia, written in HTML, almost a standard among hypermedia mark-up languages. The resulting hypermedia can be browsed and queried using Mosaic, an interface developed in the framework of the World Wide Web project. In particular, the set of methods and techniques used for the automatic construction of the hypermedia is described in the paper and their relevance in the context of Multimedia Information Retrieval is highlighted. Keywords: Information Storage and Retrieval, Content Analysis and Indexing, Content Based Retrieval, Hypertext/Hypermedia, Automatic Authoring of Hypermedia. 1 Introduction In Information Retrieval (IR) systems, the user starts to search the documents pertinent to his informative requirements by entering a query. The system replies to the user by retrieving from a large collection the documents matching the user's query. This querying strategy might be considered as a batch process since it seems that the user cannot adequately interact with the system. On the contrary, inhypermedia systems, the user-system interaction by browsing is the main feature. People often think that only hypermedia can provide browsing. This is untrue, the ability to move between related topics or documents can also be provided by IR systems supporting relevance feedback [van Rijsbergen, 1979]. Unlike hypermedia, which generally has links statically xed by an expert user, relevance feedback allows the user to dynamically create links at run time by searching for documents similar to some others marked as relevant. However, users will only use browsing if it is easy to do. Browsing by means of relevance feedback isavery complex process and most of the existing IR systems supporting relevance feedback do not have a good user interface for browsing. Moreover, though early work on browsing text collections in IR dates back to the seventies [Oddy, 1975], only very few experimental IR systems allow browsing [Thompson, 1989, Frisse, 1988], and only fairly recently there has been a new impulse in this research direction (see for example [Agosti et al., 1989, Dunlop, 1991]). Systems providing either browsing and querying search strategies allow users of accessing ahypermedia by browsing after a query has been issued. So users are also given access to documents that have not matched the query. In particular, given a retrieved document, 2

the user can be now pick the document neighbours up, if they have not matched the query. This mixed access way is useful especially if the collection is made up of multimedia documents as well. Indeed, the multimedia document indexing process is rather dicult because of the number of kinds, dierent nature and representation way of media. For these reasons, multimedia document indexing should require more methodological and experimental work, whereas textual document indexing has been deeply studied through experiments carried out in several contexts. Some approaches have been proposed to index multimedia document collections: we adopt the approach proposed by Dunlop in [Dunlop, 1991], because the non-textual document indexing is possible using the neighbour documents. In fact cluster based techniques are used to relate indexed documents that are neighbour to the multimedia document. In the same way multimedia portions of a complete document can be indexed and interconnected to construct an hypermedia document. For example, a Fig. of a document is indexed using the descriptors of its caption and these descriptors are related to the descriptors that are in neighbour clustered text portions of the complete document. The approach presented in this paper aims at enabling users of large multimedia document collections to browse the document base in a natural way, navigating through connections representing statistical or semantic relationships between multimedian IR (MIR) objects. A MIR object can be text, Fig., term, picture, concept, etc.. The approach is content based because it uses in a coherent wayvarious IR techniques of content representation for linking MIR objects. It should be noted that these techniques have been developed separately and now connected together in a complete approach for the automatic authoring of a multimedia collection to construct and make available an IR hypermedia. The model presented in [Agosti and Crestani, 1993] provides a conceptual reference for the network structure of an IR hypermedia to be built. An IR hypermedia isamultimedia document base which allows access to multimedia documents mainly by browsing, but it has been authored using IR techniques of content representation and linking. An IR hypermedia is composed of nodes, that are stores of information, and links, that are connections between nodes. The user makes his browsing navigating from node to node using links. The series of navigational choices which are made leads, hopefully, through the document base to the desired information. The automatic authoring of multimedia documents is made easier by the chosen indexing approach presented in [Dunlop, 1991]. Through this approach, authoring a multimedia document is marking the neighbour textual documents up. This means that a multimedia node is inserted into the IR hypermedia if one of more neighbours are nodes as well. In other words, the descriptors representing a textual document are used for representing content-based links between non-textual documents. From this, it appears that the main core of the work is the automatic authoring and contruction of IR hypermedia starting from the document collection of texts. Therefore, the paper concentrates on the presentation of the research results that permit the building-up of the IR hypermedia. Manual authoring is feasible if the collection to be authored is not as large as the ones typically managed by an IR system. This is because manual authoring is a time-consuming 3

process and it is feasible only if it is not a hard task for an expert user in terms of time. Moreover, authoring \by hand" depends on who is marking documents up and then on his subjective criteria. On the contrary, automatic authoring represents the way to construct a hypermedia from a large collection of multimedia documents, without suering either the limitations of time and the expert user subjectiveness. In fact, the methodology we propose is based on well known and sound IR techniques and it allows user to construct a hypermedia that is the result of a unbiased process, because links are xed according to statistical measures. The presence of dictionaries and thesauri helps the user during the query formulation and browsing while he is looking for documents relevant to his informative requirements. Automatic authoring is becoming more and more important as a task within the electronic publishing, information dissemination and retrieval processes, because a lot of information are indeed contained in journal issues and in general in written form. If novel hypermedia techniques are used, such as the automatic authoring, the user can overcome the traditional linear reading of documents previously available only in textual format on paper. For example, the ACM document collection could be automatically authored; the querying and the brwosing processes could be easier for the user that would have also the possibiliy of using also the ACM classication scheme as a content based representation tool. We feel that, while maintaining the same scientic value, such a document collection could be accessed and browsed from remote sites using non traditional tools that make easier retrieval, reading and understanding of the documents content. 2 The Approach for Automatic Authoring and Construction The starting point of the approach is the usual set of IR raw data: a at large document collection. Documents are available as individual unrelated objects. The approach has the following aims. Each aim concerns with the setting of: a homogeneous collection of terms, namely the index terms collection the concept collection the network of links within each collection: documents (D-D links), terms (T-T links), and concepts (C-C links) the network of links between a pair of collections: documents - terms (D-T links), terms - concepts (T-C links). 4

Each aim can be reached in one or more steps. With the exception of the rst aim, which must be reached before any others, there is no strict or unique order for the reaching of the remaining aims. Some aims can also be reached in parallel for a faster construction of the hypertext. The order we use in the presentation follows a simple idea: rst determine the dierent objects, then build up links between homogeneous objects and at last between objects of dierent collections. For the presentation of specic methodological details, related to this approach, the reader is referred to [Agosti and Crestani, 1993]. 1. Construction of the Collection of Index Terms During this step the index terms are created and connected to documents (D-T links). The collection of index terms is created by extracting terms from documents using an automatic process known as automatic indexing. It is by means of this process that individual or groups of terms found in documents become index terms, assuming a representational power that places them on a higher level of abstraction than the documents. The indexing process is a very complex process which has been studied for long time in IR. It constitutes the core of the IR research because it is the technique by means of which a document informative content is represented and the content-based retrieval made possible. In fact the obtained description of the document informative content can then be used to nd an answer to the user information need by means of a matching process with the user query, which is represented by means of the same indexing process. There are many ways to perform indexing on a set of documents. The most complete way to perform it can be divided in: term extraction, stop terms removal, conation, weighting. We adopt this complete set of techniques of performing indexing; please note that these techniques are well described in classical IR textbooks [van Rijsbergen, 1979, Salton and McGill, 1983]. 2. Links between Concepts using Semantic Relationships We assume that we are able to identify a set of concepts of the application domain. From an IR point of view, there is no operative advantage in having a set of application domain concepts if they are not connected to each other according to their semantics. It is by looking at the relationships a concept has with other concepts that we can understand the \meaning" of the concept in the context of the application domain. When this meaning has been fully understood, it is also possible to understand the \usage" of the index terms connected to the concept. In fact, they just represent the way the concept has been addressed in the documents belonging to the collection. The way a concept has been addressed by authors of documents in the collection could dier from the way the user of 5

the IR system is addressing it. Using a very precise term in addressing a concept increases the precision of the retrieval. However, the user could be interested in considering a concept in a loose way. This can be done using index terms expressing concepts semantically related to the concept which is central to the user's information need. In this way it is possible to increase also the recall of the retrieval. The utility of having a tool which provides for each concept a set of semantically related concepts has long been recognised in IR. A thesaurus is a tool which provides for each term in a specic application domain a set of terms related to it by some well dened semantic relationships. For its nature, the structure of associations represented in a thesaurus can be directly mapped into a network structure: concepts are mapped to nodes (C nodes) and concept relationships to links (C-C links). Sometimes a thesaurus on the specic application domain is not available. In this case it becomes necessary to build up the network of concepts manually. The rst essential step is to identify a set of concepts and their relationships. The fundamental types of semantic relationships commonly expressed in a thesaurus are: scope, equivalence, hierarchical and associative relationships (see for example [Srinivasdan, 1992]). They provide a useful frame of reference on the kind of relationships to be taken into consideration for a manual construction of a network of concepts. 3. Association between Index Terms and Concepts The semantic association between index terms and concepts can be built using dierent formal approaches. The approach described in [Agosti and Marchetti, 1992] and named \semantic association" permits the automatic construction of links between index terms and concepts (T-C links). The cited paper reports a complete description of this technique. 4. Statistically Determined Relationships between Index Terms There are many techniques for identifying relationships between index terms, for example, using the concept network it is possible to relate index terms by means of objects on a higher level of abstraction. In this work we use a technique for nding relationships between index terms using only information present on the same level of abstraction. This technique does not involve the semantics of index terms but only information provided by statistical analysis of index term occurrence in documents. See [Agosti and Crestani, 1993] for a detailed description of the technique used for the construction of T-T links. 6

Concept level Index term level Auxiliary data Document level Concept (C) Index term (T) Document (D) Collection of documents Figure 1: A conceptual schema of the IR hypermedia 5. Automatic Determination of Relationships between Documents For an automatic set up of links between documents (D-D links) it is possible to use statistical techniques very similar to those employed for the construction of links between index terms. Other techniques for setting up a network of related documents make use of bibliographic citations. Bibliographic citations can be used to build up a network implicitly assuming that the documents cited by a document must be somehow related to it. Most operational IR systems use only the D-T links in the retrieval process. These are represented in the inverted le structure which is the most common storage structure in IR. Only very few operational IR systems enable the user take advantage of relationships like those established by C-C and T-C links, and they are used only as an aid to query formulation. Relationships like those represented by T-T and D-D links are used only in few experimental IR systems. A schema of the IR hypermedia produced by the authoring process is depicted in Fig. 1. 3 The Automatic Authoring of an IR Hypermedia A library of MIR object classes has been developed using C++. The library implements the basic IR structures and its abstract interfaces allow user to use IR functionalities. It is important to note that the class library has been developed as independent ofany specic application as possible so that using it a designer can nd and re-use basic IR structures and functionalities without having to re-implement them. According to this independence requirement, the class library includes classes that are independent of a specic application 7

so it can be used as a generic IR framework. Using this class library the automatic authoring of an IR hypermedia is done from a collection of documents. The automatic authoring process produces a hypertext in which each MIR object is connected to other ones by means of links. Links connecting MIR objects are set up on the basis of dierent criteria, such as, similarity among documents (D-D links), synonymy and contiguity among index terms (T-T links), pertinence between documents and index terms (D-T links), and semantics between index terms and concepts (T-C links). An instance of an object of the document class refers to the set of index terms extracted from it and describing its informative content. D-D links are placed on the basis of the measures of similarity among documents. The reference of a document towards another document is an attribute encapsulated by the former. We can represent a collection by dening an ad-hoc sub-class of the document class. An instance of the auxiliary data class represents an auxiliary data and it is used to represent the semantic content of a set of documents. The abstract interface of the auxiliary data class enables the application designer to access a generic auxiliary data, without considering the specic criterion by which the auxiliary data has been constructed and associated to the pertinent documents. This means that the class auxiliary data provides an \umbrella" to manage a generic auxiliary data sub-class specialised from it. This provides the designer with ad-hoc tools to manage specic types of auxiliary data. Our approach provides two types of auxiliary data: index terms and concepts. For their distinctive characteristics we think it is useful to distinguish the sub-classes concept and index term by specialising them from the auxiliary data class to emphasise the specic feature of concepts and index terms with respect to a generic auxiliary data. Index terms are auxiliary data which have been automatically extracted from documents through an indexing process. An index term is associated with its frequency of occurrence within the collection; it is also associated to the set of documents from which it has been extracted. T-T and D-T links are placed on the basis of information provided by statistical analysis of index terms and documents occurrence respectively. The references of a document towards its extracted index terms, the references of an index term towards another index term and towards the pertinent documents are attributes encapsulated by the objects representing documents and index terms. Concepts are represented through instances of the class concept that implements the third level of the conceptual architecture. C-C links are set up on the basis of the semantic relationships among concepts. A relationship between two concepts is an entity holding the semantics that has to be represented. A semantic relationship between two concepts is represented through an instance of the relationship class. Since there can exist dierent types of relationship between concepts, it is useful to break the class relationship up into dierent sub-classes. Thus, the relationship class is specialised in more sub-classes describing the fundamental types of semantic relationships commonly expressed in a seman- 8

tic structure; these sub-classes are: scope, hierarchical, synonymy and associative. The sub-class hierarchical has been further specialised in the class specialisation to represent the relationship between a concept to the more specic ones. It has been previously stressed how it is almost always necessary to model \by hand" the semantic relationships between concepts. It is the user himself or a team of domain experts who has to build up the semantic structure which represents important and useful application domain knowledge. However, if this semantic structure is represented and stored in a machine readable form, the prototype is able to build up automatically the network of concepts. This means that, the tool can recognise concepts and relationships among concepts that are coded in a machine readable form, and it is no longer necessary to manually build up the network of concepts and it is also possible to set up automatically the T-C links. We have previously described the semantics of these links: concepts can be linked to several index terms and an index term can be linked to dierent concepts; for example, the concept \Information Retrieval" is linked to the index terms \Information" and \Retrieval", but the index term \Information" is also linked to the concept \Information Processing". In general, an automatic indexing algorithm considers index terms made by one word, such as \Information" or \Retrieval". The indexing algorithm we developed treats index terms in that way too. However, our class library provides functions to split concepts up into one or more terms; if a split term is an index term, the connection with the concept is set up. In analogous way, index terms can be concatenated to construct multi-word term; if the latter is a concept, the T-C link is completed. Therefore, the passage from the index term level to the concept level, and vice-versa, that is, the T-C linking mechanism, is possible through the operations of splitting of a concept up into terms and concatenation of terms to build a concept. 4 The Automatic Authoring Process The automatic authoring process makes use of the class library presented in the previous Section. The process is depicted in Fig. 2: the process input is a at document collection and the output is an IR hypermedia that is written in HTML (HyperText Mark-up Language). IR hypermedia can be browsed and queried using Mosaic (for machines with a graphical interface) or Lynx (for machines able to deal only with text). It is importantto highlight that our approach is general and applicable to several types of collection, as long as it is possible to have them in some standard machine readable forms. At present we can handle plain ASCII, LaT E X, BibTEX, BIDS (standard used by the Institute of scientic information Data Service at Bath), and INSPEC. We are currently adding capabilities for translating into HTML document formats written with other standards. In addition, we have automatically authored the ACM classication scheme to provide an IR hypermedia with a widespread and wide-ranging concept collection for the computing and computer science domain. This means that, the ACM document collections could be automatically 9

flat document collection documents representation stop words removal conflation indexing dictionary weighting concept collection concept network construction HTML document base automatic authoring MOSAIC Figure 2: The automatic authoring process authored and queried using the ACM classication scheme itself. Automatic authoring becomes more important if one takes into account the way documents on machine readable form and on-line bibliographies do broaden over the Internet. Documents are indeed physically stored on dierent Internet sites, but they are interrelated through citation links. If this approach were used, it would make available all those documents connected by means of content based links. The authoring process is divided into the following sequence of steps: 1. Collection loading and document representation. The collection is analysed to produce document representations in terms of objects of the class document. 2. Indexing process. The aim of this task is populating the class index term that makes up the dictionary of the collection. As words are extracted from documents, they are removed if they are stop words or, otherwise, they are conated. The Porter's stemming algorithm [Porter, 1980] is used to conate words to index term. Index terms extracted from each document item are merged into a unique list and associated to the document. 3. Semantic structure loading and concepts representation corresponds to the second phase of the design process. Like the rst task, this one too depends on the particular semantic structure. We have previously outlined how the tool is able to automatically read, represent and manage a semantic structure if the latter is stored in some standard format and including the information for setting up the relationships among concepts. Therefore the C-C links are set up during this task. 4. Automatic authoring. This step can be considered as the core of the entire process 10

because it makes available to the user an automatically constructed IR hypermedia. It is this task that implements the last three phases of the design process. The computation of the similarity measures and the automatic setting of the D-D, D-T, T-T and T-C links are performed during this step. The operations performed during this task store items of the three levels of the conceptual schema into a collection of HTML documents. The HTML documents are linked among themselves using the mechanisms made available by the hypertext mark-up language. A HTML document is linked to another or to a part of itself by means of a pair of tags, say, \link" and \anchor" tags; Mosaic is provided with the functionality of retrieving and displaying the anchored document after the user has clicked on the link tag. A document node is authored with link tags on the index terms extracted from its text. Special tags give access to the dictionary and to the classication system. An index term node is authored by linking all the documents from which it is extracted, all the index terms similar to it, and all the related concepts of the classication system. 5 Browsing and Querying an IR Hypermedia Using Mosaic the user can access any MIR object of the MIR hypertext by means of two dierent procedures: browsing and querying. In [Agosti and Crestani, 1993] we have stressed and justied the importance of using browsing and querying together to access IR documents. On the network structure of the IR hypermedia it is possible to browse among concepts, index terms, and documents, exploring the large document and auxiliary data space. It is also possible to query the IR hypermedia using the keyword search procedure available through Mosaic. A more complex technique for query processing is under development. In fact, using an IR hypermedia, the process of querying can be enhanced through spreading activation techniques (see, for example, [Salton and Buckley, 1988]). Once the user has entered the network structure of the IR hypermedia using a concept, an index term, or a document, he can go on building up a query by browsing over other concepts, index terms, or documents and including in the query those that he thinks are relevant to his information need. After the user has built up a query by browsing an automatic procedure can be activated. This makes use of the dierent semantics associated to links and node types can spread activation over the network and use concepts, index terms or documents that are closely related to those indicated by the user in the query. The user can provide some feedback to the system by marking the nodes that he considers relevant in the retrieved list. In this way the user assesses if the spreading has been successful or not in including new MIR objects to his query. This process is similar to the relevance feedback technique used in advanced IR systems. In those systems, relevance feedback is used to modify the query terms according to the suggestions the user gives back to the system after he has marked the relevant documents. New query terms are determined by the system on the 11

Figure 3: TACHIR home page basis of the weights of the previous query terms. Our approach does not only provide query modication based on statistical analysis (D-T and T-T links), but also on semantic relationships (T-C and C-C links). After a new query has been formulated, the user can start a new spreading activation process and continue its search in an iterative and interactive process controlled (constrained) by the system. 6 Initial Experimental Results of Constructing and Browsing of an IR Hypermedia We have developed a tool for the automatic construction of an IR hypermedia which makes use of the class library presented in the previous Section. Wehave called it TACHIR, which stands for: Tool for the Automatic Construction of Hypermedia for Information Retrieval. TACHIR can be activated from inside a Mosaic session by clicking on the devoted button of the home page. In Fig. 3 we can see the \Automatic Construction of an IR hypermedia" button which activates such a function. The user is asked to indicate the location of a collection of documents and a collection of concepts such as a thesaurus or the ACM classication scheme, if existing. TACHIR builds up automatically the corresponding IR hypermedia that can be browsed and queried using Mosaic. If one or more an IR hypermedia are already available, the user can otherwise pick up the \Browsing and Querying of an IR Hypermedia" to eectively browse or query the chosen IR hypermedia having available the functionalities illustrated in Section 5. At present, only document collections following the BIDS, LaT E X, BibTEXand plain ASCII 12

documents have been used to automatically generate an IR hypermedia. At the current stage, we tested the prototype by adopting a quite large BibTEXbibliographic reference collection. It is important to highlight that our approach is general and applicable to several collections, as long as it is possible to have them in a machine readable form. We are implementing new TACHIR functionalities to translate other types of collections into an IR hypermedia. Some collections are made up of documents that include reference to Fig.s and bibliographic references, other than their usual structured full-text: for example, LaT E X documents. Nowadays, tools for translating LaT E X documents into HTML documents are available, but they lack in the automatic authoring. In the following, a sort of guided tour to the construction of the IR hypermedia of a BibTEXcollection is presented to explain with a real case study the complete approach and construction of an IR hypermedia. Of course, this tour cannot be exhaustive, but it is representative of the possibilities available to the users of TACHIR. We have considered as input raw data a BibTEXcollection, since a BibTEXcollection can be comprehensive of abstracts that are full-text documents. A BibTEXitem is a bibliographic record including dierent kinds of entry: keywords, abstract, other than the usual data elds, such as title, authors, aliation, and so forth. In the following, it is used a BibTEXcollection of 18,000 entries on object-orientation. As we have previously pointed out, we have also chosen the ACM classication scheme to be the semantic structure placed at the third level of the architecture, since it is one of the most widespread semantic structure in the computing and computer science domain. The ACM classication scheme entries are hierarchically organised and each entry is a concept that can hold one or more narrower concepts and has a broader concept. Each entry of the ACM classication that can be picked up by the user is underlined by the prototype. The user can select a whole entry or a part of it; for example, given a document title, it is possible to retrieve the entire document or the information associated to a title term, whether the user does click on the whole title or on a title term respectively. However, it must be noted that these characteristics are typical of other collections as well. Once the user has selected the document collection of his interest, the BibTEXcollection in this guided tour, he chooses the starting point of the browsing among the three levels of the architecture that is depicted in Fig. 1: 1. the collection of documents, 2. the set of index terms that are automatically extracted from the documents, 3. the set of concepts that take part in the ACM classication scheme. Let us suppose that the user has chosen the second level, namely, the dictionary term level. After the user has chosen the index term object, a page containing links to the related information is displayed. The information related to an index term are represented by 13

Figure 4: An index term and its related information three buttons linking similar terms, pertinent documents and related concepts. Each of these sets of information represents a \direction" along which an index term spreads its semantics: two directions are vertical ones, towards the higher and lower levels, and the other is horizontal one, the level where the term is placed. Clicking one of these \directions" permits the user to get the concepts explaining the index term semantics or the documents whose semantics is explained by the index term. At this point it should be noted that such way of starting is one of the possible ways: another strategy is based on querying, but we are now interested mainly in browsing since we are addressing content-based hypermedia functionalities. When the user is looking at an index term he can pick upanentry among dierent ones; for example, the user might be interested in looking up the documents that are pertinent to the term object. Index terms have been associated to a triplet of sets: the sets of similar terms, pertinent documents and related concepts. Let us suppose that the user clicks on the anchor of the pertinent documents. In Fig. 5, the list of documents pertinent to object is presented. The user does pick a specic document up from the list: the document identied by Abiteboul84 is picked up by the user because he infers that it could be of interest for him. After being selected, it is presented to the user (Fig. 6). Only after the selected document has been read through, the user can decide if it is effectively relevant to his informative requirements. Sometimes a user is not able to nd the relevant documents after the rst selection. In that case, he has to reformulate the query, by clicking the button of terms extracted from the selected document, or by going back to the term level, or reading through the list of the similar documents. After having 14

Figure 5: The list of documents pertinent to object Figure 6: A document pertinent to object 15

Figure 7: Terms extracted from a document chosen the list of extracted terms, the user selects the term database by trying a kind of query reformulation. Picking up the term database among the extracted terms appearing in Fig. 7, the user could collect the documents pertinent to it; from these documents it is possible to choose another document of the list and to take it into a page. However, the term database is rather general and it is not semantically meaningful to the user. Then, it is possible that the user wishes to access the third level by looking for the concepts related to that term. The concept database management is useful just to clarify to the user the possible contexts in which the term database is used. Fig. 8 displays such concept and the underlined terms are those output by the indexing process. ACM classication scheme entries are alphanumeric strings representing concepts. These concepts are organised in a hierarchical manner according to a narrower-broader relationship. Such a entry is made of one or more terms, and some of these terms can be index terms and, then, belong to the second level of the architecture. Accessing concepts through an index term allows user to see the concepts related to it. The index term-concept (T-C) association rule is used during the automatic authoring process: each index term is connected to the concepts containing it, and each concept is connected to the component index terms. This rule is based on a quite straightforward mechanism: when the user is browsing the IR hypermedia at the second level, he can retrieve the concepts whose componentwords equals the pointed index term, together with the available narrower and broader concepts; symmetrically, the user can ask the system to retrieve the index terms forming the concept he is considering during the browsing of the third level. We are going to address the diculties in relating terms to concepts in a more \intelligent" way, since, for example, there could be more concepts related to a term. However, we are aware of the complexity of this task and of those similar ones: automatic construction of thesauri and passage retrieval, to name but a few. 16

Figure 8: A concept related to database If the user is not able to nd some useful information out of the concept database management, he should have to go down to the second level to try another strategy, such as, the retrieval of the pertinent documents that are relevant to database, like in the Fig. 5, from which a relevant document is retrieved (Fig. 9). Conclusions We have presented a complete content based approach for the automatic construction of an IR hypermedia and an eective tool based on it. Such a tool has been developed by us to enable the user to produce automatically a hypertext structure written in HTML from a collection of documents. This hypertext, that we called IR hypermedia because it can be enriched by multimedia documents as well, can be browsed and queried using any of the World Wide Web graphical interfaces supporting HTML, for example Mosaic, running on various dierent platforms, and ranging from UNIX to Macintosh. The availability of such a tool can make large collections of documents available for browsing and querying to a large number of users throughout the Internet. At present we are addressing the problems connected to the enhancement of the querying capabilities. In particular we are developing a querying tool that uses a form of constrained spreading activation over the IR hypermedia to produce and present to the user a ranking of the documents because a form of spreading activation could permit the engagement of the user in an iterative and interactive browsing/querying process. 17

Acknowledgements Figure 9: A document relevant to database This work was partially funded by a 1993 MURST grant of the Italian Ministry of University. The work of Massimo Melucci has been supported in part by a grant of IDOMENEUS, the ESPRIT Network of Excellence No. 6606 on Information and Data on Open Media for Networks of Users, for a visiting period at the Department of Computing Science of the University of Glasgow (Scotland). References [Agosti and Crestani, 1993] M. Agosti and F. Crestani. A methodology for the automatic construction of a Hypertext for Information Retrieval. In Proceedings of the ACM Symposium on Applied Computing, pages 745{753, Indianapolis, USA, February 1993. [Agosti and Marchetti, 1992] M. Agosti and P.G. Marchetti. User navigation in the IRS conceptual structure through a semantic association function. The Computer Journal, 35(3), 1992. [Agosti et al., 1989] M. Agosti, G. Gradenigo, and P. Mattiello. The Hypertext as an Eective Information Retrieval Tool for the Final User. In Antonio A. Martino, editor, Pre-proceedings of the 3rd International Conference onlogics, Informatics and Law, pages 1{19, Florence (Italy), 1989. [Dunlop, 1991] M. Dunlop. Multimedia Information Retrieval. PhD Thesis, Department of Computing Science, University of Glasgow, Glasgow, UK, October 1991. 18

[Frisse, 1988] M.E. Frisse. Searching for information in a medical handbook. Communications of the ACM, 31(7):880{886, 1988. [Oddy, 1975] R.N. Oddy. Reference retrieval based on user inducted dynamic clustering. Phd thesis, University of Newcastle upon Tyne, UK, Computing Science Department, 1975. [Porter, 1980] M.F. Porter. An algorithm for sux stripping. Program, 14(3):130{137, 1980. [Salton and Buckley, 1988] G. Salton and C. Buckley. On the use of spreading activation methods in automatic Information Retrieval. In Yves Chiaramella, editor, Proceedings of ACM SIGIR, Grenoble, France, June 1988. Laboratoire IMAG Genie Informatique. [Salton and McGill, 1983] G. Salton and M.J. McGill. Introduction to modern Information Retrieval. McGraw-Hill, New York, 1983. [Srinivasdan, 1992] P. Srinivasdan. Thesaurus construction. In W.B. Frakes and R. Baeza- Yates, editors, Information Retrieval: data structures and algorithms., chapter 9. Prentice Hall, Englewood Clis, New Jersey, USA, 1992. [Thompson, 1989] R.H. Thompson. The design and implementation of an intelligent interface for Information Retrieval. Technical report, Computer and Information Science Department, University of Massachusetts, 1989. [van Rijsbergen, 1979] C.J. van Rijsbergen. Information Retrieval. Butterworths, London, second edition, 1979. 19