Study of the heterogeneity in cultural databases and transformation of examples from CIMI to the CIDOC CRM

Size: px

Start display at page:

Download "Study of the heterogeneity in cultural databases and transformation of examples from CIMI to the CIDOC CRM"

Anissa Howard
5 years ago
Views:

1 Study of the heterogeneity in cultural databases and transformation of examples from CIMI to the CIDOC CRM Iraklis Karvasonis Institute of Computer Science, Foundation for Research and Technology Hellas Science and Technology Park of Crete P.O. Box 1385, GR , Heraklion, Crete, Greece Technical Report 291, ICS-FORTH, October 2001 Abstract: The goal of our work was to transform a small sample of different cultural databases to the CIDOC CRM [4] and study the heterogeneity in the data structure between these models. The data structure mapping between these models is also analyzed in this report. The transformations were accomplished with Data Junction [6] conversion tool and the final output format is XML, based on a DTD of the CIDOC CRM. The results of this work show that CIDOC CRM captures adequately and effectively the domain of museum data and covers semantically all the concepts, which arose from the source models we examined. Some propositions for a better data structure of the models we examined, which will help for a better transformation of them to the CIDOC CRM, are also reported. Keywords: heterogeneity, transformation, CIDOC CRM, data structure mapping, relational databases, XML 1. General goal and observations: The goal of our work was to examine the heterogeneity between various data types and the CIDOC Conceptual Reference Model and to transform all these data from a form given by CIMI members to the CIDOC CRM. CIMI [5] is a consortium of cultural heritage institutions and organizations. It has called for testing information integration models with real data from its members [1]. The data we used were in the Microsoft Access format and are databases that describe objects of 3 museums, which are members of CIMI. The "CIDOC object-oriented Conceptual Reference Model" (CRM) [4], was developed by the ICOM/CIDOC Documentation Standards Group. It represents an 'ontology' for cultural heritage information i.e. it describes in a formal language the explicit and implicit concepts and relations relevant to the documentation of cultural heritage. The primary role of the CRM is to serve as a basis for mediation of cultural heritage information and thereby provide the semantic 'glue' needed to transform today's disparate, localized information sources into a coherent and valuable global resource. We considered some principles for the design of ontologies used for knowledge sharing [8] for the better understanding of the mapping problem to the CIDOC CRM. We explored the differences between the source data types and the CIDOC CRM definitions and tried to find a way to represent the same meaning in a different way. Our goal is, by a conversion of the data, to represent as best as we can and with not loss of information the data in a form conformant with the CIDOC CRM. The data samples we have mapped and converted are from the National Museum of Denmark, the Museum of Natural History London (Clayton Herbarium) and Australian Museums On - Line. The semantics of all these samples were completely covered by the CIDOC CRM. There were however differences in the complexity and the degree of automation that could be achieved. 1

2 The conversions were accomplished with a commercial conversion tool, Data Junction 7.5 [6], which covers a large variety of conversions with a lot of capabilities. The target files are XML instances of the simplest DTD, which represents correctly the CIDOC CRM semantics and allows creating instances structurally equivalent to correct RDF instances of a full RDFS version of the CIDOC CRM. The target files can be read naturally using an xsl file making the properties visible. For every example we had to identify the sample schema-to-crm mappings and then implement and test the mappings. There is a straightforward step to wrap the whole sample in an XML instance, which takes longer for deeply nested tables and then the semantic mapping from XML to XML. During the conversions we used different kinds of mapping so that the data will be compatible with the CIDOC CRM. These kinds of mapping and the problems we faced during these conversions are explained more detailed below. 2. Kinds of Mapping: During the study and process of the examples of different kinds of databases we observed a lot of different kinds of mapping patterns, depending on the complexity and the structure of every database. Some issues about the mapping of objects to Relational databases [2] were also considered for the better understanding of the databases. For every database we worked with, first we converted the database to a one to one full representation of it to XML. After that we designed a conversion model for every database from its current representation to XML, based on the CIDOC Conceptual Reference Model and a Data Type Definition of it. So, for every field that we initially had from the database, we had either to correspond it to the appropriate element of the CIDOC CRM or to combine one or more fields and convert them so that they will be compatible with the CIDOC CRM. In all the following figures that show a kind of mapping, we have an object that a record of the database refers to, a field for this record and the value of this field. A new record for the same object is created, it is a CIDOC CRM DTD instance, has some element names and every element has a value. The element names correspond to property instances of the RDF Schema that describes the CIDOC CRM. Every value of an element belongs to a class of the same RDF Schema. In the creation of the object as a CRM instance we use some global identifiers, which are a compound of the name of the identifier that is used in the database and the value of this identifier. All these cases are examined with more details below. 1) The first situation is the simplest we found during the conversions of the data. We have an object, which is described within a database record, a field from the database for this record and a value for this field. A new object is created as a CRM instance and the field name corresponds to the appropriate element of the CIDOC CRM. The value of this element is the value of the field we corresponded to the CRM instance. A graphical representation of this kind of mapping is shown in the following figure. record field name field value A B record element name element value C Figure 1. The simplest kind of mapping D In figure 1, A stands for the object, a record of the database as a whole refers to. The link from A to B stands for a field from the database of the respective record and 'B' stands for the value of this field. 'C' stands for the same object as a CRM instance representation in which we converted 'A'. The link from C to D stands for the element name of the CRM instance and 'D' stands for the value of this element. The link from C to D in figure 1 and the analogous links in the following examples correspond to RDF property instances and the C and D values correspond to RDF class instances. The class to which C belongs depends on the interpretation of the table name, so it can be for example a Man - Made Object or a Biological Object. The class to which D belongs depends on the interpretation of 2

3 the name of the field which we are converting to the CIDOC CRM. So, if e.g. the field name is Creator then D belongs to the class Person. For example, in the data from the AMOL museum, the field ObjectID from the database was converted to the element is identified by of the CIDOC CRM. The value of the element is identified by is a class instance of the RDF Schema of the CIDOC CRM and belongs to the class E42: Object Identifier. 2) The next situation is more complicated because from a field of the database, we create two or more elements of the CIDOC CRM. This happens because the data may be complicated so that we have to cut it into sub strings and create more than one element. Sometimes this also happens because of the structure of the CIDOC CRM, which needs more depth in an element even to some simple situations. A graphical representation of this kind of mapping is shown in the following figure. A field name B C D E F Figure 2. Creation of more than one element The depth of this kind of mapping depends on the complexity and the content of the data. For example, in the AMOL database a field named "Statement" contains the data "Utopia, Northern Territory, Australia". This field was converted to the "took place at" element with data "Utopia". The next element was the "falls within" element with data "Northern Territory" and the final element was also the "falls within" element with data "Australia". So, the initial field was converted into three elements of the CIDOC CRM and the created path is the following: took place at falls within falls within. Each one of the C, D, E, F element values in figure 2 is a class instance and its value depends on the name of the field, which is converted, and the value of this field. The inverse case, from two or more fields to create only one element, was observed in situations where the Relational database uses internal identifiers to connect its tables. So, we don't need these identifiers and after the conversion we have the field we want with the correct data in only one element. This was observed in the data of the national Museum of Denmark, where after the one to one representation of the database to XML we had very big depth. During the conversion we rejected all internal identifiers and the final XML archive has less depth. 3) In this case we have a compound of more than one field for the creation of the final field. A field is an ontology which has more than one different parts. So, the compound of these parts gives us the name of the ontology, which is the value of the field we use in the conversion. Some principles for the design of ontologies were also considered in this situation [8]. A graphical representation of this kind of mapping is shown in the following figure. B A E C D Compound of the fields B, C, D F G Figure 3. Compound of more than one field 3

4 For example, in the data from the Clayton museum, the ontology Variety of an object is the compound of the fields Linnaean Genus, Linnaean Species and Linnaean Variety. The produced element is the element assigned in the CRM instance and the value of this element is the value of the ontology Variety. 4) The following case depends on the data of some fields of the databases rather than on the schema only, as in the previous cases. This means that the mapping is not based on the structure of the database only, but also on the content of its fields. So, we use some expressions and conditions during the conversions, which analogous to their values and results guide us to achieve the corresponding to the CIDOC CRM mapping. A graphical representation of this kind of mapping is shown in the following figure. E A B C D Figure 4. The created element depends on the value of some fields of the database The name of the created record, the name and the value of the created element in figure 4 depend on the E and B field values. For example, in the data from the National Museum of Denmark the table Event has a field named FieldID. Analogous to the value of this field we understand if it is an accession event or a measurement event or just a classification event. Depending on the kind of the event we have to convert this field to the corresponding element of the CRM instance, which can be the element was measured or was classified by or was produced by etc. 5) The next case of mapping, which we met through the conversion of all the databases, is the most complicated. We have two or more parallel fields for the same object which are interpreted as one path of nested elements by using the information of all the fields. The created object as a CRM instance will have a path of elements, which will result from the combination of the initial fields. A graphical representation of this kind of mapping is shown in the following figure. A B C D E F Figure 5. Combination of more than one field In figure 5, the combination of the two fields produces an object, which has two subsequent elements. For example, in the data from the Clayton Museum we have the fields State and Country that describe the place of an event. The field State is converted to the took place at element in the CRM instance and the field Country to the falls within element, which is nested to the previous element. So, the following path is created: took place at falls within. 4

5 6) In the last case we have two or more fields for a specific object within a record of the database and in the created object as a CRM instance we create an intermediate node, which contains nested in it the elements which the above fields will be converted to. A graphical representation of this kind of mapping is shown in the following figure. A B C F D E Intermediate node G Figure 6. Creation of an intermediate node For example, in the data from the Clayton Museum we have one field that describes the collector of a collection event and another that describes the date of the collection event. After the conversion, we have an object as a CRM instance and the changed ownership by element that describes an event that happened to this object. This element is the intermediate node which has nested in it the other elements. The collector of the event was converted to the transferred title to element and the date is the at most within element. So, there are created the following paths: changed ownership by transferred title to, changed ownership by at most within, going through the same intermediate node. 3. Interpretation problems: During the conversion we had some interpretation problems because the databases, which we used for the conversions, didn't have a clear explanation of how some fields and tables of them could be interpreted. So, the first problem was the ambiguous labels that some fields had, so that we couldn't understand easily in which category of the CIDOC CRM they fit better. This happens because the CIDOC CRM is a very detailed model, which covers a very big range of cases contrary to the simple fields of a database. So, for example the fields 'Person', 'Date' etc. don't explain exactly their real meaning, because they can have more than one explanation. This is a problem that complicated our conversions and in most of the cases we solved this problem by examining the data examples of these fields. Another problem were the fields that contain a lot of information, which is not separated in a standard way. For example, a field that describes the person, the place ant the date of an event should contain this information in the same way in all the records of the databases. Some of the databases we worked with, especially the AMOL database encode this information in a different way from one record to another so that the mechanical parsing of these fields becomes impossible. For example, the two records below are from the field Made from the AMOL database. The first record is Pule, Lena; Utopia Batik Program; Ingkwalalanima camp; Utopia, Northern Territory and the second is G & J Weir Ltd; Glasgow, Scotland. The first record contains a person, a name of a project and two places. The second record contains a company and a place. So, we could not identify a simple rule to extract this information to the CRM automatically in a way that will fit all the records of the database. The solution of this problem concerns the practice of substructuring data in the database. The same information should be separated in the same way in all the records of the database or different fields should be created for every piece of information. 5

6 A problem that we also met is the redundancy of the same information in a database. Some tables contain the same information in a different way. For example, in the NMD data the event for the measurement of an object is described once in general and then it is repeated with the details of the measurement. Some fields also in the same table contain some information and in the same table exists another field that combines all the previous fields. This is meaningless for the CIDOC CRM and the conversion we do, because from the last field we can retrieve all the sub information we need. For example, in the NMD database the fields PrefixCharacter, PrefixNumber, NumberPart, SuffixCharacter, SuffixNumber of the table Object are parts of the field Inventory Number of the same table. So, we need only the Inventory Number field and the other are not applicable for the conversion. Below we see some of the examples we examined, with explanations. 4. First Example: Data of the National Museum of Denmark (NMD) The source file of this conversion is a database in Microsoft Access and the graphical representation of the relations between its tables, is shown in the picture below. The target file is an XML file and is based on the CIDOC Model. For this conversion we also used a DTD for the CIDOC CRM. This database has a lot tables and some information is repeated in some tables more than once. So, after the one to one representation of the database to XML we had nesting in a very big depth, because there are used a lot of identifiers in the database for connectivity between its tables. These identifiers are not applicable to the CIDOC CRM. So, we had to discard these identifiers and design a conversion model with the appropriate mapping to represent correctly all the information to the CIDOC CRM. We had a lot of problems to achieve a correct mapping, because the data were complicated and some fields of the database were not explained well so that we would easily understand to which element of the CIDOC CRM they correspond. The mapping was also sometimes depending on the value of the data and was not the same for the same fields. This means than some fields of the database, for example these that describe some events, have more than one corresponding element to the CIDOC CRM depending on their value. So, the mapping becomes more complicated, but in general the data were converted with their entire initial meaning and content to the CIDOC CRM. The NMD data are analytical in the necessary detail to allow for complete automatic transformation. Two default assumptions not obvious from the data could be clarified with the creator and expanded to the CRM. The events of the classification and the use of an object are different in the CRM, but in the NMD data they are presented together in one event. The same happens with the events of the measurement and the acquisition. Further, as the NMD database uses dynamic types of events, a full mapping of the NMD event types to the CRM classes could have improved the mapping. The relations between the tables of the database, the mapping to the CIDOC CRM and an example of the final XML archive with the XSL we created are presented below. 6

7 4.1 The relations between the tables of the NMD database Figure 7. The relations between the tables of the NMD database 7

8 4.2 NMD to CRM mapping The NMD to CRM mapping, which is presented below, is based on the Mapping of the Dublin Core Metadata Element Set to the CIDOC CRM [7]. Table "Object": NMD [E22: Man-Made Object] NMD.ObjectId = E41: Appellation NMD->NMD.ObjectId = P1 is identified by: Appellation NMD.PrefixChacter, NMD.PrefixNumber, NMD.Year, NMD.NumberPart, NMD.SuffixCharacter, NMD.SuffixNumber, NMD.SuffixExensionCharacter, NMD.SuffixExensionNumber, NMD.PersonWhoCreatedRecord, NMD.DateWhenRecordWasCreated = not applicable NMD.InventoryNumber = E42: Object Identifier NMD->NMD.InventoryNumber = P47 is identified by: Object Identifier Table "Hierarchy": NMD.HierarchyId, NMD.ObjectId = not applicable NMD.EventId = P12 was present at or P39 was measured or P41 was classified, see NMD.Event->EventCode NMD.ParentObjectId = P46 forms part of Table "Event": NMD.EventID = see Hierarchy->HierarchyId NMD.EventCode = E55: Event Type NMD->NMD.EventCode = P2 has type: Event Type NMD.IndexRegion1Id = E41: Appellation NMD->NMD.IndexRegion1Id = P7 took place at P87 is identified by: Appellation NMD.IndexRegion2Id = E41: Appellation NMD->NMD.IndexRegion2Id = P7 took place at P87 is identified by: Appellation NMD.IndexPlaceName1 = E53: Place NMD->NMD.IndexPlaceName1 = P7 took place at: Place NMD.IndexPlaceName2 = E53: Place NMD->NMD.IndexPlaceName2 = P7 took place at: Place NMD.ActorId = not applicable NMD.CulturalPeriod = E52: Time Span NMD->NMD.CulturalPeriod = P86 falls within: Time Span NMD.StartTimePresentation = E52: Time Span NMD->NMD.StartTimePresentation = P82 at most within P79 begins at: Time Span NMD.EndTimePresentation = E52: Time Span 8

9 NMD->NMD.EndTimePresentation = P82 at most within P80 ends at: Time Span NMD.StartTime = E52: Time Span NMD->NMD.StartTime = P82 at most within P79 begins at: Time Span NMD.EndTime = E52: Time Span NMD->NMD.EndTime = P82 at most within P80 ends at: Time Span NMD.EventNote = P3 has note NMD.RecordCreationPerson, NMD.RecordCreationDate = not applicable Table "Event Actor": NMD.EventActorId, NMD.EventId = not applicable NMD.IndexActorRoleId = E55: carried out by Type, see table IndexActorRole->RoleId NMD->D.IndexActorRoleId = P14 in the role of: carried out by Type NMD.ActorId = E39: Actor, see table IndexActor->ActorId NMD->NMD.ActorId = P14 carried out by: Actor Table "Index Event Type": NMD.EventCode = E41: Appellation NMD->NMD.EventCode = P1 is identified by: Appellation NMD.EventCode1Name = E55: Event Type NMD->NMD.EventCode1Name = P2 has type: Event Type NMD.EventCode2Name = E55: Event Type NMD->NMD.EventCode2Name = P2 has type: Event Type Index "Region 1": NMD.IndexRegion1ID = E41: Appellation NMD->NMD.IndexRegion1ID = P7 took place at P87 is identified by: Appellation NMD.IndexRegion1Name = E53: Place NMD->NMD.IndexRegion1Name = P7 took place at: Place Index "Region 2": NMD.IndexRegion2ID = E41: Appellation NMD->NMD.IndexRegion2ID = P7 took place at P87 is identified by: Appellation NMD.IndexRegion2Name = E53: Place NMD->NMD.IndexRegion2Name = P7 took place at: Place 9

10 Table "Index Actor": NMD.ActorId = E41: Appellation NMD->NMD.ActorId = P1 is identified by: Appellation NMD.Initials, NMD.Acronym, NMD.RecordCreationPerson, NMD.RecordCreationDate = not applicable NMD.Title, NMD.SurNames, NMD.FirstNames, NMD.Note = P3 has note NMD.StreetAndNumber, NMD.Town, NMD.State, NMD.Country, NMD.Telephone, NMD.PostalCode = E45: Address NMD->NMD.StreetAndNumber, NMD->NMD.Town, NMD->NMD.State, NMD- >NMD.Country, NMD->NMD.Telephone, NMD->NMD.PostalCode = P76 has contact point: Address Table "Index Actor Role": NMD.IndexActorRoleId = not applicable NMD.IndexActorRoleName = E55: carried out by Type NMD->NMD.IndexActorRoleName = P14 in the role of: carried out by Type Table "Dimension": NMD.ObjectDimensionId, NMD.Condition, NMD.RecordCreationPerson, NMD.RecordCreationDate = not applicable NMD.ObjectId = E54: Dimension or E55: Man-Made Object Type NMD->NMD.ObjectId = P40 observed dimension: Dimension or P42 assigned: Man-Made Object Type NMD.EventId = E16: Measurement or E17: Assignment or E8: Acquisition NMD->NMD.EventId = P39 was measured: Measurement or P41 was classified by: Assignment Table "Object Form Material": NMD.ObjectFormMaterialID, NMD.ObjectDimensionID, NMD.Reliability, NMD.OrderOfMaterialEntry = not applicable NMD.MaterialOriginalentry = E57: Material NMD->NMD.MaterialOriginalentry = P45 consists of: Material NMD.MaterialCorrection = E57: Material NMD->NMD.MaterialCorrection = P45 consists of: Material NMD.ProductionMethode = P3 has note "Method: " NMD.Color = P3 has note "Color: " 10

11 Table "Object Form Measurement": NMD.ObjectFormMeassurmentID, NMD.ObjectDimensionId, NMD.SpecialMeassurement = not applicable NMD.IndexMeassurementTypeId = P2 has type, see table IndexmeasurementType- >TypeName NMD.IndexUnitOfMeassurementId = P91 unit, see table IndexMeasurement- >UnitTypeName NMD.Meassurement = P90 value Table "Index Measurement": NMD.IndexMeassurementUnitTypeID = not applicable NMD.IndexMeassurementUnitTypeName = P40 observed dimension P91 unit Table "Index Measurement Type": NMD.IndexMeassurementTypeID = not applicable NMD.IndexMeassurementTypeName = E55: Dimension Type NMD->NMD.IndexMeassurementTypeName = P40 observed dimension P2 has type: Dimension Type Table "Object Role Classification": NMD.ObjectRoleClassificationId, NMD.ObjectDimensionId = not applicable NMD.IndexClassification1Id = see table IndexClassification1->Name NMD.IndexClassification2Id = see table IndexClassification1->Name Table "Index Classification 1": NMD.IndexClassification1Id = not applicable NMD.IndexClassification1Name = E55: Man-Made Object Type NMD->NMD.IndexClassification1Name = P2 has type: Man-Made Object Type Table "Index Classification 2": NMD.IndexClassification2Id = not applicable NMD.IndexClassification2Name = E55: Man-Made Object Type NMD->NMD.IndexClassification2Name = P2 has type: Man-Made Object Type 11

12 Table "Object Photo": NMD.ObjectPhotoId, NMD.ObjectId = not applicable NMD.EventId = P108 was produced by NMD.PhotoNumber = E3: Document, E38: Image NMD->NMD.PhotoNumber = P70 is documented in: Document, Image Table "Index Capture Type": NMD.IndexCaptureTypeId = not applicable NMD.IndexCaptureTypeName = P70 is documented in P3 has note 4.3 An example of the final NMD archive with the XSL XML is a language that can represent a relational database [3]. So, every database we worked with was firstly transformed to a one to one representation of it to XML [10]. Some architectural issues for integrating XML and relational database systems were also considered [9]. All the final XML archives, which came out of the conversions, are like the following RDF schema: <!--Description of Epitaphios GE > <crm:e23.iconographic_object rdf:about="epitaphios_ge34604"> <crm:p19.1f.is_identified_by> <crm:e42.object_identifier rdf:about ="TA_959a"/> </crm:p19.1f.is_identified_by> <crm:p19.1f.is_identified_by> <crm:e42.object_identifier rdf:about ="GE_34604"/> </crm:p19.1f.is_identified_by> <crm:p19.3f.preferred_identifier_is rdf:resource="ge_34604"/> <crm:p1.1f.has_type> <crm:e55.type rdf:about ="ecclesiastical_embroidery"/> </crm:p1.1f.has_type> <crm:p1.1f.has_type> <crm:e55.type rdf:about ="liturgical_cloth"/> </crm:p1.1f.has_type> The element <crm:p19.1f.is_identified_by> is a property instance, the attribute rdf:about ="TA_959a" is the value of this element and the attribute crm:e42.object_identifier is the class to which this element belongs. So, in the NMD example, which is presented below, and for all the other examples we have the following correspondence. The value mandsfigur of the example below is the value of an element and corresponds to the about attribute of the above RDF schema. The value is identified by is a property instance and so it corresponds to an element of the above RDF schema. Finally, the value ( E22: Man-Made Object ) is a class instance and corresponds to the crm attribute of the above RDF schema. 12

13 13

14 14

15 15

16 5. Second Example: Data of the Clayton Museum The source file of this conversion is a database in Microsoft Access. The target file is an XML file and is based on the CIDOC Model. For this conversion we also used a DTD for the CIDOC CRM. The Clayton Herbarium sample is equally analytical as the NMD, even though it is encoded in one table. This database has only one table, which contains all the information about the objects of the museum. So, it is easier than the other databases to convert this data to the CIDOC CRM. After following some of the kinds of mapping we described above, we represented all the information to the CIDOC CRM. Even though the Clayton database is not in any normal form, e.g. assigning the same fields once again for a second event, it can be mapped without any difficulty. Some piece of information is also repeated more than once and in some occasions some data of two or more fields are contained to a more general field, which is unnecessary, because we can take this information from the other fields of the database. Some fields also had information that is useless to the CIDOC CRM or contain no information, so we didn't have to convert them. The fields of the database, the mapping to the CIDOC CRM and an example of the final XML archive with the XSL we created are presented below. 5.1 The fields of the database: RowID Update1999 LinnaeanGenus LinnaeanSpecies LinnaeanVariety ClaytonNo OldBarcode Barcode Image LTPuniqueNo SpecimenoAtBM DuplicateAtLINN Country State Collector CollectionDate FVPhraseName FloraVirginicaEdition FloraVirginicaPage Determination1Name Determination1Genus Determination1Species Determination1Author Determination1InfraRank Unique number Indicates record updated but now (in 2001) redundant Generic name (where name described by Linnaeus) Species name (where name described by Linnaeus) Varietal name (if any) (where name described by Linnaeus) Clayton collection number Old Barcode New Barcode = image filename Confirms presence of image (Yes/No) Where name described by Linnaeus this refers to a unique number for that particular name in a database belonging to the Linnaean Typification Project. Specimen actually present at BM or found at BM (Yes/No) Duplicate specimens at the Linnaean Society (Yes/No) Country of origin of specimen State in country of origin of specimen Collector of specimen Date of collection of specimen Corresponding phrase name for specimen in Flora Virginica Edition 1 or 2 of Flora Virginica Page number Any determination (genus and species) Any determination - just genus Any determination - just species Authority for determination name Rank i.e. variety or subspecies if determination made at this level Determination1InfraName Determination1InfraAuthor Determination1ByDate Name of subspecies or variety if determination made at this level Authority for subspecies or varietal name if determination made at this level Name of person who has made determination and date 16

17 Determination2Name, etc. (as above) LinnaeanAuthority LinnaeanReference LinnaeanVolume LinnaeanPage LinnaeanYear LinnaeanTypeStatus CurrentDivision CurrentFamily Genus Species CurrentSpeciesAuthor CurrentSubspecies CurrentSubspeciesAuthor CurrentVariety CurrentVarietyAuthor Comments Authority for any Linnaean name (by definition, Linnaeus) Bibliographic reference for Linnaean name (i.e. place of description) Relevant volume for any Linnaean bibliographic reference Relevant page for any Linnaean bibliographic reference Year of publication of any Linnaean name Type status of specimen in relation to any Linnaean name Division of the current name of any Linnaean name Family of current name of any Linnaean name genus of current name of any Linnaean name species of current name of any Linnaean name authority of current species name subspecies (if any) of current name of any Linnaean name authority of any current subspecies name variety (if any) of current name of any Linnaean name authority of any current varietal name Any comments or notes with regard to particular specimen 5.2 Clayton to CRM Mapping: The Clayton to CRM mapping, which is presented below, is based on the Mapping of the Dublin Core Metadata Element Set to the CIDOC CRM [7]. Clayton [E20: Biological Object] Clayton.OldBarcode = E42: Object Identifier Clayton->Clayton.OldBarcode = P47 is identified by: Object Identifier Clayton.Barcode = E42: Object Identifier Clayton->Clayton.Barcode = P48 preferred identifier is: Object Identifier Clayton.Image = E31: Document Clayton->Clayton.Image = P70 is documented in: Document Clayton [E8: Acquisition] Clayton.Collector = E39: Actor Clayton->Clayton.Collector = P24 changed ownership by P22 transferred title to: Actor Clayton.State = E53: Place Clayton->Clayton.State = P24 changed ownership by P22 transferred title to P7 took place at: Place Clayton.Country = E53: Place Clayton->Clayton.Country = P24 changed ownership by P22 transferred title to P7 took place at P89 falls within: Place Clayton.CollectionDate = E52: Time Span Clayton->Clayton.CollectionDate = P24 changed ownership by P82 at most within: Time Span Clayton.ClaytonNo = E55: Plant Species Type Clayton->Clayton.ClaytonNo = P2 has type: Plant Species Type Clayton.ClaytonNo = E41: Appellation Clayton->Clayton.ClaytonNo = P2 has type P1 is identified by: Appellation Clayton.LinnaeanGenus + Clayton.LinnaeanSpecies = E41: Appellation Clayton->Clayton.LinnaeanGenus + Clayton-> Clayton.LinnaeanSpecies = P2 has type P1 is identified by: Appellation 17

18 Clayton.LTPuniqueNo = E41: Appellation Clayton->Clayton.LTPuniqueNo = P2 has type P1 is identified by P1 is identified by: Appellation Clayton.LinnaeanReference = E32: Authority Document Clayton->Clayton.LinnaeanReference = P2 has type P1 is identified by P67 is referred to by: Authority Document Clayton->Clayton.LinnaeanVolume + Clayton->Clayton.LinnaeanPage + Clayton->Clayton.LinnaeanYear = P2 has type P1 is identified by P67 is referred to by P3 has note Assignment Clayton.LinnaeanTypeStatus = E55: Type Clayton->Clayton.LinnaeanTypeStatus = P2 has type: Type Clayton.Determination1 = E17: Type Assignment Clayton->Clayton.Determination1 = P41 was classified by: Type Clayton.Determination1By = E39: Actor Clayton->Clayton.Determination1By = P41 was classified by P14 carried out by: Actor Clayton.Determination1Date = E52: Time Span Clayton->Clayton.Determination1Date = P41 was classified by P82 at most within: Time Span Clayton.Determination1Genus = E55: Genus Type Clayton->Clayton.Determination1Genus = P41 was classified by P42 assigned: Genus Type Clayton.Determination1Species = E55: Species Type Clayton->Clayton.Determination1Species = P41 was classified by P42 assigned: Species Type Assignment Clayton.Determination2 = E17: Type Assignment Clayton->Clayton.Determination2 = P41 was classified by: Type Clayton.Determination2By = E39: Actor Clayton->Clayton.Determination2By = P41 was classified by P14 carried out by: Actor Clayton.Determination2Date = E52: Time Span Clayton->Clayton.Determination2Date = P41 was classified by P82 at most within: Time Span Clayton.Determination2Genus = E55: Genus Type Clayton->Clayton.Determination2Genus = P41 was classified by P42 assigned: Genus Type Clayton.Determination2Species = E55: Species Type Clayton->Clayton.Determination2Species = P41 was classified by P42 assigned: Species Type Clayton.FloraVirginica = E32: Authority Document Clayton->Clayton.FloraVirginica = P67 is referred to by: Authority Document Clayton->Clayton.FloraVirginicaEdition + Clayton- >Clayton.FloraVirginicaPage + Clayton->Clayton.FloraVirginicaName = P67 is referred to by P3 has note Clayton->Clayton.Comments = P3 has note 18

19 5.3 An example of the final CLAYTON archive with the XSL 19

20 20

21 6. Third Example: Data of the AMOL Museum The source file of this conversion is a database in Microsoft Access. The target file is an XML file and is based on the CIDOC Model. For this conversion we also used a DTD for the CIDOC CRM. This database has two tables. The first contains all the information about the objects it describes and the second contains the photos that correspond to the objects. The problem with this database is that all of its fields contain a lot of information, which should have been separated into more fields, because now the same information is repeated a lot of times and it is not easily parsed so that it can be used for the conversion. The records of the database are not written in a standard way, so that the information has not a good structure that can be easily processed. So, the database has fields with weak semantics like description, statement and made note. These seem to be pretty much functional as formatting means, in the tradition of museum catalogs, but cannot be used to interpret semantics. The disciplined use of some separators could have helped us more in the conversions. As the data are now, automatic interpretation needs the use of background knowledge: place name, person name, materials, organization names and object type authorities, heuristics and eventually natural language interpretation. Therefore we show here the result of a manual transformation, which demonstrates that the CRM captures completely the meaning of these data. This analysis may be useful to propose some kind of tagging scheme for the AMOL database facilitating automatic processing in a better way. The fields of the database and an example of the final XML archive with the XSL we created are presented below. 6.1 The fields of the database: ObjectID Name Statement Designed Made Date DateType Description Marks Dimensions DesignedNote MadeNote UsedNote Used OwnedExchange OwnedExchangeNote Subject Category CollDevField Unique number for the database use only Name of the object Contains complex information about the creator, the place and time of the creation Name of the designer and place of this act Contains complex information about creator and place of creation Date of the creation Type of the creation A description of the object Inscriptions on the object The dimensions of the object Note for the design of the object Note for the creation of the object Note for the use of the object Contains complex information about the person, the place and the date of the use of the object Contains complex information about the person or the company and the place of this action Note for the previous action The subject to which the object is related The category to which the object belongs More general categories to which the object belongs 21

22 6.2 An example of the final AMOL archive with the XSL: 22

23 23

24 7. Conclusions: In conclusion and after all the databases we examined and converted to the CIDOC CRM, we can say that all the data were successfully converted without loss of information, although we faced a lot of problems during the conversions. The CIDOC CRM covers a very big variety for the representation and explanation of the objects, the events and everything that refers to a museum object. Some of these databases however had problems in their structure and some fields of them were ambiguous and we didn t know exactly in which element of the CIDOC CRM they fit, because this model has a lot of detailed elements for the representation of an object. So, some of the source databases had an underspecification problem, because the CIDOC CRM has more than one element for the same value of the databases and covers a very big variety of circumstances. The CIDOC CRM captures adequately and effectively the domain of museum data and covers semantically all the concepts, which arose from the source databases we examined. So, we didn t have problems to find if the CIDOC CRM covers the concepts, but only to find the correct element that corresponds more properly to this concept. This happens because, as we described above, the CIDOC CRM has a lot of detailed elements for the representation of an object. So, the complexity of mapping is typically due to the intrinsic complexity of interpreting cultural data sources and in no means introduced by the CRM. For the decoding of the databases and their conversion we first converted the databases in a one to one representation of them to XML with the use of Data Junction conversion tool. Then we designed a conversion model, which shows how every field of the databases will be converted to the CIDOC CRM. This means that every field of every record of the database has a corresponding element to the CIDOC CRM. Finally, we used the Data Junction conversion tool for the implementation of this conversion model we designed. During this we used some different kinds of mapping, which helped us to complete the conversions. Some problems that we had during the conversions were due to the heterogeneity between the databases and the CIDOC CRM. The main heterogeneity problems were the ambiguous naming of some fields of the databases, the repetition of the same information within the same database and the not standard structure within the records of the same database. The naming of the fields of the databases may be sufficient for the database itself but it is not sufficient for the CIDOC CRM, which is a more detailed model. The repetition of the same information was carefully examined so that only the part of the information, which we needed, was converted to the corresponding elements of the CIDOC CRM. The databases should also have a standard structure within all the records of the same database because they cause parsing and mapping problems. Some databases also contain a lot of information, which is difficult to be parsed and correspond to the CIDOC CRM. For example, in the AMOL database some general fields contain a lot of information, which is not uniformly distributed and cannot be easily parsed. The solution for databases like this is the uniform distribution of the information within all the records of the database or the generic fields to be divided into more than one more specific field. The natural language analysis and the comparison of some values of the database fields with a thesaurus is also a good solution in situations where the fields are not properly labeled or it is not clear what the meaning of the information is. Then the information will be very easily mapped to the correct elements of the CIDOC CRM. So, with the AMOL data, it could be shown that the CIDOC CRM could be useful to design and introduce a moderate structuring to facilitate semantic interpretation, which is easily comprehensive by end-user documentalists. The Clayton data also show that this structuring needs in now ways be complex and deep as the CRM, nor that the end user needs to fully understand the CRM. All data samples show that the CRM instances are comprehensive, even though the presented form was not designed for presentation but to an understanding of the machine interpretable raw data themselves. After the end of the conversion, all the National Museum of Denmark and the Natural History London sample can be transformed without manual intervention. This means that every database with the same structure as the NMD and Clayton databases can be easily converted to the CRM in an automatic way. We believe that CRM instances are now ready for automatic integration. So, given persons etc. can sufficiently be identified globally. This is again a problem of the integration process and not of the 24

25 CRM. Thus, we foresee that an automatic integration could be achieved through the use of the global identifiers, which are used in the CRM instances. Finally, this test shows that a non-domain expert with usual knowledge in handling IT tools can execute the transformation with a short advise from a domain expert knowledgeable also about the CRM. This advise is once per database, and not per data, if data are sufficiently structured. This intellectual investment cannot be avoided in any intelligent data integration, which tries to preserve and to respect the intellectual qualities of our cultural heritage information. 8. References and Bibliography [1] ABC/Harmony CIMI Collaboration Project, [2] Scott Ambler: Mapping Objects to Relational Databases, October 2000 [3] Ronald Bourret, XML and Databases, [4] CIDOC Conceptual Reference Model (see [5] CIMI organization (see [6] Data Junction conversion tool (see [7] Martin Doerr: Mapping of the Dublin Core Metadata Element Set to the CIDOC CRM, July 2000 [8] Thomas Gruber: Toward Principles for the Design of Ontologies Used for Knowledge Sharing, August 1993 [9] Gerti Kappel, Elisabeth Kapsammer, Werner Retschitzegger: Architectural Issues for Integrating XML and Relational Database Systems The X-Ray Approach [10] W3C: World Wide Web Consortium, XML representation of a relational Database (see 25

Mapping Language for Information Integration

Mapping Language for Information Integration Haridimos Kondylakis 1, Martin Doerr 1, Dimitris Plexousakis 1 1 Institute of Computer Science, FORTH-ICS P.O. Box 1385, GR 71110, Heraklion, Crete, Greece