Automatic Metadata Analysis for Environmental Information Systems

Size: px

Start display at page:

Download "Automatic Metadata Analysis for Environmental Information Systems"

Christiana Owen
6 years ago
Views:

1 1 Automatic Metadata Analysis for Environmental Information Systems Jens Hartmann 1 and Heiner Stuckenschmidt 2 Abstract Metadata plays an important role in web-based environmental information systems (EIS). It structures existing information and provides background information about technical issues as well as the context in which information has been generated or should be interpreted. Further, many systems such as the BUISY (Bremer Umweltinformationssystem) use metadata in order to provide content-based search facilities. Such methods, however, depend on the correctness and completeness of the metadata. In this paper, we discuss an approach for automatically analyzing the metadata of web-based information systems that is based on machine learning techniques and its application to different environmental systems. We analyze three web-based EIS and discuss our results. Introduction The importance of metadata in the context of managing and accessing environmental information has been widely recognized and is witnessed by a number of publications at previous conferences on computer science in environmental protection. Attention has been drawn towards metadata standards, the creation of metadata, metadata based access of information, and metadata analysis and maintenance. In connection with the introduction of web-based information systems, the prominent role of metadata has been recognized very early (Crossley 1994). In modern web-based information systems, metadata is no longer just an addition to the actual information, but it plays an active role in the functionality of the system. An example is the BUISY system, that uses metadata annotations in order to provide a content-based search facility (Voegele et al. 2000). This new role of metadata as part of the systems functionality makes the need for ensuring correctness and completeness of metadata even more vital. At the same time, metadata validation becomes more difficult in a web-based context as it usually appears in terms of annotations on individual web pages thus disabling previous validation approaches 1 Center for Computing Technologies, University of Bremen, jhart@tzi.de 2 AI Department, Vrije Universiteit Amsterdam, heiner@cs.vu.nl

2 2 that relied on a centralized access to metadata in terms of a database (Voigt et al. 1999). We argue that there is a need for supporting the analysis of metadata that indirectly contained on web pages in terms of annotations in some markup language such as HTML, XML or RDF. Currently, we focus on a special form of content related metadata, known as web page categorization. Here, the task is to assign the pages of a web site to a set of predefined object classes as they are used in well known metadata repositories such as the UDK (Swoboda et al. 2000). Unlike in the case of the UDK, object classes do not refer to the type of the information source, but to the subject area of a page such as air pollution or water protection. Based on a representation of HTML and XML documents as a Document Object Model (DOM) we use a formal representation to describe the logical structure of a document. We then identify structural patterns that carry content related information, in our case metadata structured that occur on web pages in terms of HTML meta tags. The structural descriptions of web pages are divided into a training- and a test set. Using the inductive logic programming system Progol (Muggleton 1995) we can now generate classification rules that relate structures on web pages to different topic areas in an environmental system. These classifiers consist of logical rules describing what all pages of a category have in common (Stuckenschmidt et al. 2002). For example we discovered the following rule for the environmental information system of Bavaria that classifies pages from the area of waste management: document(a) :- relation(a,b), metatag(b,keywords,abfall). The rule requires that all pages on this topic have a link to other pages that contains the word abfall in its keyword list. In the same way we can also analyze other properties of a web page that may contain information about its content. The result of this rule generation process provides us with a means for assessing the quality of the metadata in the system, because well-designed metadata should clearly identify the subject of a web page at least at an abstract level. This could be done by assigning keywords to pages as in the example above or by directly linking a page to a subject area. In the presence of such meta information pur learning approach should be able to detect the corresponding pattern in the page structure with an accuracy of 100 percent. Any learning result with a lower accuracy is an indicator for missing, false or badly designed metadata. In the following, we first describe the our validation approach in more details. We then summarize the results of applying the result to three web-based environmental information systems in Germany and Austria. We interpret the result of the experiments by speculating about the origin of misclassifications.

3 3 The Approach The analysis of metadata used in Environmental Information Systems (EIS) utilizes structural rules which describe common metadata of a given category, e.g. the category of waste management. The process of generating such structural rules is represented as a knowledge discovery process, which can be separated into the following five steps (based on Chang et al. 2001). 1. Data Cleaning 2. Data Transformation 3. Data Reduction 4. Data Mining 5. Knowledge Representation The first step of our approach consists of a cleaning of noisy or inconsistent data in the documents, since we are using documents of different EIS which are applicable on the World Wide Web (WWW). The next step is a transformation of possible different document formats into one defined format for further processing. Therefore, we represent HTML and XML documents as a Document Object Model (DOM) and use a formal representation to describe the logical structure of a document. A small set of predicates has been capably defined to express structural elements. This representation can be applied to known and even unknown documents. As data reduction, information is represented in a more general way than it occurs in the original data. The developed generalization process primarily includes a generalization of text and single words. In general, all words are depicted as lowercase alphanumerics. Each word is represented with one predicate. To illustrate, a document title Welcome to MY Homepage!! would be translated as a set of four predicates {welcome, to, my, homepage}. The Syntactical Transformation incorporates several pre-processing steps, which verify the syntactical structure of the desired documents after XHTML standardization. In general we defined a General Transformation based on a generalization of text manifest in these documents. The transformation process is based on the Document Object Model (DOM) representation, which is traversed by the developed software. Declared document structures are extracted and represented in this formal way. Therefore we appropriate PROLOG syntax. For the identification of general regularities (data mining) usable as a classifier, the generated sets of formal clauses are used as input to the Inductive Logic Programming (ILP) system Progol. Given potentially available background knowledge (BK) and a set of positive and negative examples Progol generates a hypothesis, which explains the positive examples and the BK. This rule is then applicable as a general classifier of each document class. The generated rules are stored in a separate file and are used as background-knowledge for further classification tasks.

4 4 Learning Metadata Classifiers Metadata offers an expressive framework for analyzing documents of web-based information systems. This data yields different aspects of information. Metadata is comprised of information about such technical issues as access methods or processing instructions, as well as information about such document content as intended uses or author information. The study made by (Yang et al. 2002) shows in detail the importance of well structured metadata for classification tasks. Generally, metadata can be expressed syntactically by so called Meta Tags. We represent metadata by means of the metatag/3 predicate, which is defined as follows: metatag(i, N, C) descendant(d,i) structure(i, meta) attribute(i,q) attribute(i, W) structure(q, x ) value(q,n) structure(w, y ) value(w,c). where x {http-equiv, name} and y {content}. The values of the attributes N and C are confined by the weak generalization. Further, every single word is associated with one predicate. The number of (metatag) predicates for a document is consequently: n i= 1 p * v i where n is the number of metatags in a document, p is the number of elements in {http-equiv, name} (typically p=1) and v is the cardinality of {content}. For the generation of structural metadata classifiers we use Inductive Logic Programming (ILP), identifiable as an intersection of machine learning and logic programming (Muggleton 1999). In general, ILP concerns the generation of a hypothesis H describing a set of examples E (partitioned into a set of positives E + and negatives E - ) and given BK. In specific, the normal semantics of ILP can be formalized as follows: 1. B E - 2. B E + 3. B H E - 4. B H = E + The normal semantics of ILP demands that (1) the BK be consistent w.r.t the negative examples (prior satisfiability) and that (2) in necessary learning processes the BK does not already explain positive examples (prior necessity). Furthermore, the completeness of the learned hypothesis is given when it (3) is consistent w.r.t the negative examples (posterior satisfiability) and likewise (4) explains all positive examples (posterior sufficiency). i

5 5 Using ILP for data mining processing it is possible to discover relational regularities among training sets, undetectable with classical attribute-value learners. To illustrate, we present the following rule: document(a) :- relation(a,b), relation(b,c), doctitle(c,abfall). This rule precisely classifies all data set documents. Lacking these relational descriptions, one attains accuracy of but 80,30%. In summary, learning relations and relational regularities among documents predictably increases the accuracy of learned classifiers (Hartmann 2002). Experiments To evaluate the developed approach we used three web-based environmental information systems. The systems show a similarity between their data structure and a comparable number of documents. All documents have been automatically downloaded by the web downloader wget. We applied our approach to validate the metadata of the following three EIS: Bremen: Vienna: Bavaria: We analyzed topic areas that normally structure information within these systems (waste management, soil protection, nature conservation, air- and water pollution) and sorted pages into these topic areas based on their metadata. In general, this is theoretically perfectly accurate for EIS maintaining content-related metadata. Environment evaluation is determined by classifier accuracy, as calculated with the following rule: P( A) + P( A) P( A) + P( A) + P( A) + P( A) where P(A) provides the number of correctly classified positive examples and P(A) the number of positive examples classified as negatives. Analogously, P( A) indicates the number of correctly classified negatives and P( A) the number of negatives classified as positives. The experiment results of Table 1 reveal the well-designed metadata infrastructure in all analyzed systems, excluding portions of the Bavarian system. We classified a majority of tags from meta information provided by these systems.

6 6 Kategorie P(A) P(A) P( A) P( A) Acc. BUISY Abfall BUISY Boden BUISY Luft BUISY Natur ,41 BUISY Wasser ,25 BUISY Gesamt 98,53 Ubavie Abfall Ubavie Boden Ubavie Luft Ubavie Natur Ubavie Wasser Ubavie Gesamt 100 Bayern Abfall Bayern Boden Bayern Luft ,33 Bayern Natur Bayern Wasser Bayern Gesamt 58,67 Total 85,73 Table 1: Classification Results The learning process benefits clients in the ability to identify non-obvious relevant classification criteria. For example, our approach generated 3 the following rules: document(a) :- metatag(a,author,'zdl30-13'). document(a) :- metatag(a,bereich,naturschutz). document(a) :- relation(a,b), metatag(b,keywords,abfall). document(a) :- relation(a,b), metatag(b,keywords,bodenschutz). Interpretation of Results The results of the analysis process give us some insight in the status of the metadata annotations in the different systems and even allow us to speculate about the system itself and the way it uses metadata. In the case of the EIS of the City of Vienna for example, it is quite obvious that the metadata annotations are automatically generated as we get an accuracy of 100% for all subject classes. In the case of the BUISY 3 Note, the rules presented here comprise a subset of all experiments.

7 7 system, we see a situation, where the accuracy is very close to 100%, but we also found some mismatches. This observation can be explained by the development process of the system, which was originally designed in a research project and was then handed over to the federal administration. In the research project, metadata annotations were added automatically using a special software tool. After the system was transferred to the administration, obviously new pages were added, some of which do not contain proper metadata, thus leading to a sub-optimal classification result. For the EIS of the federal state of Bavaria, the situation is even more complex. As table 1 shows, there are some subjects where a classifier could be generated with an accuracy of 100%, for other subject area, however, no classification rule could be found at all. This observation can be explained by the fact that at the time of our experiments the system was in the process of being re-designed. Parts of the pages were already properly annotated with metadata, while other parts were not. By now this process is completed and all pages are annotated. If we would redo the experiments now, there would be a result very close to 100% accuracy for all classes in the systems. Discussion In this paper we presented an approach for automatic metadata analysis in environmental information systems. The approach can be applied to web-based environmental systems with content-related metadata. We presented a generation of classifiers as a knowledge-based process with several pre-processing steps, such as cleaning, reduction and transformation of the desired data. We argued that ILP usage is necessary to discover relational regularities among a set of documents. The process benefits clients in the ability to identify imperceptible relevant classification criteria. We presented non-obvious results of learned classifiers for three environmental systems. These results broadly apply to administration and management of webbased information systems. Processing information from web-based information systems we consider noisy, potentially erroneous data; presently, the pre-processing performs this error handling. However, this is still an open problem which requires additional work. For further reference, enhancing this approach to analyse additional structures of web-documents is described in (Hartmann 2002). Bibliography Chang, G., Healey, M.J., McHugh, J., Wang, J. (2001): Mining the World Wide Web An Information Search Approach. Kluwer Academic Publishers.

8 8 Crossley, D. (1994): WAIS through the Web - Discovering Environmental Information. presented at the Second International WWW Conference (WWW Fall 94) Mosaic and the Web - Chicago, USA (17-20 October, 1994). Hartmann, J. (2002): Lernen struktureller Regeln zur Klassifikation von Web-Dokumenten. Diplomarbeit, Universität Bremen, TZI. Muggleton, S. (1995): Inverse Entailment and Progol. In New Generation Computing, Special issue on Inductive Logic Programming, vol. 13, p , Ohmsha. Muggleton, S. (1999): Inductive Logic Programming. The MIT Encyclopedia of the Cognitive Sciences (MITECS), MIT Press. Stuckenschmidt, H., Hartmann, J., Harmelen, F. van (2002): Learning Structural Classification Rules for Web-Page Categorization. Accepted for Special Track on the Semantic Web at Flairs 2002, Pensacola, Florida. Swoboda, W., Kruse, F., Legat, R., Nikolai, R. und Behrens, S. (2000): Harmonisierter Zugang zu Umweltinformationen für Öffentlichkeit, Politik und Planung: Der Umweltdatenkatalog UDK im Einsatz. In Armin B. Cremers, Klaus Greve (Hrsg.) Computer Science for Environmental Protection '00 Environmental Information for Planung, Politics and the Public, Metropolis, Marburg. Voegele, T., Stuckenschmidt, H., Visser, U. (2000): BUISY Using Brokered Data Objects in Environmental Information Systems. In Wolf-Fritz Riekert, Klaus Tochtermann (Hrsg.) Hypermedia im Umweltschutz 3. Workshop, Ulm 2000, Metropolis, Marburg. Voigt, K., Welzl, G., Rediske, G. (1999): Datenanalyse von umweltrelevanten Metadatenbanken. In Claus Rautenstrauch, Michael Schenk (Hrsg.) Umweltinformatik 99 - Umweltinformatik zwischen Theorie und Industrieanwendung 13. Internationales Symposium "Informatik für den Umweltschutz",Metropolis Verlag Marburg. Yang, Y., Slattery, S., Ghani, R. (2002): A study of approaches to hypertext categorization. In Journal of Intelligent Information Systems. Kluwer Academic Press.

Generating and Managing Metadata for Web-Based Information Systems

Generating and Managing Metadata for Web-Based Information Systems Heiner Stuckenschmidt and Frank van Harmelen Department of Mathematics and Computer Science Vrije Universiteit Amsterdam De Boelelaan