Models and techniques for representation and management of multimedia data and information

Size: px

Start display at page:

Download "Models and techniques for representation and management of multimedia data and information"

Mark Singleton
5 years ago
Views:

Scuola Politecnica e delle Scienze di Base Corso di Laurea Triennale in Ingegneria Informatica Tesi di Laurea Triennale in Ingegneria Informatica Models and techniques for

1 Scuola Politecnica e delle Scienze di Base Corso di Laurea Triennale in Ingegneria Informatica Tesi di Laurea Triennale in Ingegneria Informatica Models and techniques for representation and management of multimedia data and information Anno Accademico 2017/2018 relatore Prof Antonio Maria Rinaldi candidato Paolo Andrea Rio matr. N

2 1 Table of Contents 1 Table of Contents Introduction Overview Multimedia Data Ontologies Motivations Characteristics Requirements Types of ontologies Upper ontology Domain ontology Multimedia upper ontology The Annotation Problem Tools and languages XML RDF OWL MPEG Protégé M-OntoMat-Annotizer Multimedia ontologies Foundational ontologies DOLCE Basic categories Qualities and quality regions Functions and relations Taxonomy-driven ontologies Machine learning approach K-means clustering MPEG-7 based ontologies MPEG Components COMM DOLCE as Core Ontology Digital Data Pattern Decomposition Pattern Content Annotation Pattern Semantic Annotation Pattern Conclusion

3 6.3 Visual Descriptor Ontology (VDO) Other multimedia ontologies M3O Multimedia Metadata Model EXIF Patterns Conclusion Bibliography

4 2 Introduction For the past three decades the World Wide Web has been growing both in size and complexity as more technologies, platforms and tools became available to both users and developers, creating a very diverse and somewhat standardized cyber-environment made up of all sorts of information and data, stored in various formats and sizes across multiple machines scattared around the globe. Names like YouTube, Google, Flickr, Pinterest and many others, have become milestones in the development of the modern Web. Such websites deal with a large amount of so called multimedia content that is publicly accessible (shared) and enriched by a multitude of users from all around the world. This led to a substantial focus on increasing both connectivity and bandwidth in order to improve the efficiency and transfer rate of such a large quantity of information. While the issue related to the physical infrastracture was easily resolved by the Internet Service Providers, the problems concerning search, retrieval and browsing of diverse and large amounts of multimedia content remains a partially unresolved issue and there is no officially standardized way to approach the problem. The primary concern of the users is finding what they're looking for among an almost infinite ocean of multimedia objects. Ideally and intuitively, this would be done by either performing a query by example (comparing two multimedia objects for similarities) or by compiling a textual query containing the desired concepts and semantics. In either case, an underlying logic must be implemented in order to fill the so called semantic gap between the physical structure of the multimedia objects and the semantics of their content. In other words, systems must resemble or mimic human thinking and interpretation and advances in this field have been steady and considerable. 4

5 3 Overview In this chapter I introduce some of the most important concepts related to multimedia data and ontologies, with an emphasis on multimedia ontologies, describing their function, goals and most important requirements and characteristics. I also provide a general overview on the tools and languages available for authoring ontologies, many of which are tightly connected with the world of the Semantic Web, since this is one of the main domains of application as far as multimedia information is concerned. 5

6 3.1 Multimedia Data Multimedia data is a diverse set of data types which consists of a wide variety of media formats (images, audio, video etc). The large amount and diverse nature of such formats requires an equally diverse and complex set of tools to handle, manage and manipulate them. That is the very scope and objective of multimedia systems. It is easy to search and retrieve multimedia objects by searching keywords and titles. However, for complex queries the search and retrieval aspect is non-trivial and actually requires to specify the meaning of the resources described on the Web (or in other large repositories of information, but the Web is the largest and most diverse). This is what we call the semantics of the data. In this paper, I will focus on a specific subset of the multimedia domain, with a highlight on multimedia ontologies and how to construct them. In the world of multimedia data, one of the most important issues is the automatic extraction of indices describing the content of a document. This is a particularly difficult task since a document containing multimedia objects can be quite complex in nature, and the extraction of semantic information (high-level features) from the raw data (low-level features) can be challenging, although many advances in the field have been made in recent decades. The extraction process is seen as an important goal because users can then search, retrieve, store, manage multimedia data based on their high-level features (content) rather than raw attributes. Making a problem-specific system to extract such features is often severely limited by the domain of application and they're usually oriented at only a limited number of data formats. It would seem, therefore, that such an option is expensive, time consuming and not really worth the trouble. 6

7 A better approach would be to construct multimedia ontologies with the aid of machine learning techniques. Multimedia ontologies can be very powerful and versatile tools that are partly guided by the data itself, since the ontologies themselves are enriched and improved using extracted semantic content while the extraction process itself is, in turn, also influenced by previously extracted information ( such a process is often described as Bootstrapping, meaning it proceeds with its action without direct input from a conscious agent, although this is not always the adopted approach). The world of ontologies is a relatively new one and opinions on what standards and approaches to follow differ vastly. That is why there is no single right way to implement a multimedia ontology. Some of the available options are presented and analyzed in this paper. 7

8 3.2 Ontologies In computer science and information science, an ontology is a formal naming and definition of the types, properties, and interrelationships of the entities that really exist in a particular domain of discourse. - Wikipedia An ontology can also be defined as a formal, explicit specification of a shared conceptualization. Formal refers to the fact that the ontology is both machine-readable and humanreadable. Explicit means that concepts, constraints and relationships are explicitly defined (perhaps via a specific language or tool). Finally, shared refers to the fact that the knowledge within the ontology can be accessed, retrieved, searched and expanded by different sources and users, and that concepts can refer to other concepts in the same ontology to form relationships. In the case of a multimedia ontology, concepts are linked to multimedia objects (images, videos, soundtracks etc.). Multimedia ontologies are used to search, retrieve and perform automated / semi-automated annotation of multimedia content. 8

9 3.2.1 Motivations The primary motivations behind the implementation of an ontology are the following: Manual search and retrieval of multimedia content is neither efficient nor convenient (huge amount of information or difficulty in searching what the user is looking for). There is a large amount of multimedia data out there but much of it is inaccessible for whatever reason (Deep web, overlooked by search engines). It's easier to find and retrieve content more efficiently. Shared knowledge base. Content reasoning (semantic annotation) and system interoperability. Information that is not explicitly defined can emerge as a result of content reasoning Characteristics Multimedia data modelled in terms of low-level features and descriptors (machineoriented and easily to extract automatically. MPEG-7 is suitable for this purpose). Mapping low-level features to high-level features. Establishing relationships between different multimedia objects by exploiting the underlying relationships between concepts in the ontology. 9

10 3.2.3 Requirements MPEG-7 compatible: It is one of the most widely used standards to describe multimedia content. Content annoted with MPEG-7 should be easily integrated in our ontology. Semantic interoperability: annotations provided by the users are useful only if they can also be shared between different systems. Syntactic interoperability: a common language is required to communicate across multiple platforms and applications. For example, in the context of Semantic Web, XML, RDF and OWL are viable options. Separation of concerns: clear division between domain knowledge and knowledge required for technical management of the data. Modularity: it is a useful desireable property to keep in mind when designing ontologies. Modularity means that the ontology can be easily maintaned and expanded without compromising the entire structure. Scalability: adding new content shouldn't be hard. Ontologies are only useful if they can adapt to a dynamic environment in constant evolution. Changes must not compromise the existing ontology. Constructing an ontology is usually a manual / semi-automated iterative process consisting of 3 steps: 1. Selection of concepts to be included in the ontology 2. Estalishment of properties to include in the ontology and their relationships 3. Maintanance of the ontology itself. Broadly speaking, there are two approaches one can choose from when deciding to build an ontology: concept-driven or data-driven. 10

11 In the concept-driven approach, an individual with specific expertise of a particular domain may decide to author an ontology using their own knowledge to populate the it. In the data-driven approach, data is used to populate and construct the ontology in addition to knowledge about the domain of interest. Obviously, the ideal goal would be to automatically construct ontologies with very little manual intervention. This is not yet an option but recent research has been moving in that direction. This is particularly difficult to achieve because it requires further studying and understanding of AI and machine learning techniques since choosing and understanding relevant concepts and relationships within a domain is something that only a human-like intelligence is capable of doing to a satisfactory degree. For the time being, semi-automatic tools are usually employed to construct ontologies. Such tools use text mining techniques to select and discover concepts and relationships between them. 11

12 3.3 Types of ontologies Upper ontology Also known as a top-level ontology or core ontology, it consists of the most general terms that are common across all domains of knowledge. Lower level ontologies derive domainspecific concepts from the more general ones provided by the upper ontology in a way reminiscient to the relations between classes and their subclasses in object-oriented programming. Examples of such ontologies are: WordNet (not technically an ontology although often compared to one and widely used in learning domain ontologies) DOLCE Dublin Core Essentially, they serve as a starting point for constructing new ontologies and for interconnecting different ontologies, such as several domain-specific ones. They contain only the most general principles and concepts derived from mathematical truths and philosophical axioms. DOLCE was built following this approach, as it contains mathematical, philosophical and psychological concepts making it reusable in a wide variety of contexts and projects Domain ontology Includes concepts that are part of a specific domain. As an ontology expands, it may be necessary to include several domain-specific ontologies. Merging different ontologies is a manual and time-consuming process but research has been directed towards achieving fully automatic merging. Domain ontologies are typically built by the designer using some 12

13 of the tools and languages introduced in previous paragraphs, such as OWL, Protégé, RDF etc Multimedia upper ontology MPEG-7 belongs to this category.they form the foundation for more specialized multimedia ontologies. It is important to note that such a format is usually not suited for automatic extraction and semantic annotation given that they only offer the bare essentials and does not provide any domain-specific concepts. It is used in conjuction with CORE and Domain ontologies by integrating the low-level visual descriptors with the Domain / Core ontology concepts in order to enable sophisticated techniques of content reasoning. 13

14 3.4 The Annotation Problem Annotating images on a small scale can be a simple task. However, mass annotation of large and diverse quantities of multimedia content is a complex and difficult task. First and foremost, annotation of content should always be done when the multimedia object itself is produced since it is cheaper and simpler to do so rather than postponing it to postproduction phase. For example, digital cameras use the EXIF format to add metadata to jpeg files (datetime of creation etc). Not only that, but it is also more convenient for images to be annotated manually by users as they create / upload them. This is however, time-consuming and often unfeasible. Another issue is the one related to the difficulty in identifying concepts and abstraction levels that are desireable and useful for the annotation process. What concepts are important? What relationships? It is easier to answer these questions when we have a clear objective in mind while designing our ontology. This requires a trade-off between having a very versatile but difficult (and expensive) to design ontology versus having a very narrow, goal-oriented ontology that is basically useless for any other application than the one it was initially designed for but is comparatively easy to construct. Since both generic and specific concepts are used in almost all applications, modern multimedia ontologies often employ one or more domain ontologies together with a core ontology which provides the highest level concepts. Finally one of the largest problems is the lack of semantic interoperability: metadata and annotations developed and generated for and by many different tools and applications are usually not compatible with applications and system other than the ones that generated them. [16] 14

15 3.5 Tools and languages Ontologies may be constructed using several languages and tools. In this paragraph I provide a brief overview for some of the most widespread options XML Extensible Markup Language, frequently used in web applications and to exchange data between applications, although Json has emerged as a strong competitor in some of its usages. It is both human-readable and machine-readable and is defined by the W3C XML 1.0 Specification. It is predominantly meant to be used by web applications but it has been employed in a wide range of systems and applications that are not web-based to represent data-structures and to serialize objects in OOP. Key concepts: Tag: present in every markup language, it is enclosed in the < > brackets. A tag may start <tag> and end </tag>. Element: it's a component enclosed in tags. <tag1>element1</tag1>. Elements can also be tags themselves, creating a nested structure of child elements. Attribute: name-value pair present within a start-tag, for example: <tag1 attr1 = test />. In this case, attr1 is the name while test is the value. 15

16 Illustration 1: Snippet of XML code RDF The Resource Description Framework (RDF) is a metadata framework that provides a degree of semantic interoperability among applications that exchange machineunderstandable metadata on the Web. It is similar to other conceptual modeling approaches (entity-relationship). The idea is to describe resources by using the triple (subject,predicate,object), where subject is the resource, predicate is a trait or aspect of the resource, and object is the target of the subject's relationship to it through the predicate. In other words, triples represent facts in the ontology. RDF has a lot of serialization formats for easy transfer and communication between systems and applications. This makes it particularly useful in the world of Semantic Web. RDF data is usually stored in relational databases or triple stores databases. RDF can be based on XML but other serialization formats are available. The subject of an RDF statement is usually a URI (uniform resource identifier) which does not necessarily represent an actual, physical resource accessible at the URI. Usually, URI's without the # symbol are conventionally used to denotate actual resource locations 16

17 whereas the # symbol indicates a URI reference ending with a fragment identifier, used for unambiguous identification of resources via the URI scheme. Finally, there are query languages for RDF of which SPARQL is the most commonly used. Its syntax is similar to that of SQL and is heavily recommended by the W3C OWL Web Ontology Language, is a family of knowledge representation languages for constructing ontologies on the Semantic Web, in other words it's a language for defining and instantiating Web ontologies. OWL languages are built upon RDF. RDF and OWL are particularly good at authoring ontologies since the basic elements of an ontology are the concepts, concept properties and relationships between concepts. Such elements can be easily modelled using either RDF or OWL. It formalizes a domain by defining classes, properties and relationships. [14] Classes, properties and interrelationships can be defined using OWL whose syntax is based on that of RDF/XML. OWL can be divided into three specialized sublanguages[15]: OWL Lite: used for simple, light-weight ontologies where a slim hierarchy with few constraints is sufficient. OWL DL: based on description logic and used for construction heavy-weight ontologies, where expresiveness is of primary importance. It also retains computational completeness, meaning that all conclusions are guaranteed to be computable in a finite amount of time (all states are reachable). OWL Full: it was designed with RDF Schema-compatibility in mind. It is in fact, an extension (both in syntax and semantics) of RDFS and it is used for ontologies that require maximum expressiveness that also require RDF/RDFS compatibility. However unlike OWL DL, it does not guarantee computational completeness. 17

18 Like RDF, OWL also works with triples but adds semantics to the schema such as mutual and symmetric relationships: A ismarriedto B implies B ismarriedto A. Another interesting feature is the concept of sameness. OWL allows to define two things as the same and it is useful for combining data from multiple sources (for example, the same concept can come from two different websites, in which case OWL can define it as the same concept). OWL has a higher level of expressivity compared to RDF and relations between classes can be modelled rigorously using description logics. It can also be serialized as RDF/XML and queried using either SPARQL (although this is not advised) or a DL query. The latter can be easily performed by the Protégé editor MPEG-7 [7]MPEG-7 is a multimedia content description standard. The metadata provided by this standard is embedded into the multimedia object itself, allowing for fast search and retrieval of desired content. Unlinke the similarly named MPEG-4, MPEG-2 etc, this standard does not specify how to encode audio or video but rather uses XML to store metadata. This differs from the previously mentioned options because while MPEG-7 is specfically directed at multimedia objects, RDF and OWL are used to built ontologies that are domainindependent. The standard includes visual descriptors representing low-level features that can be automatically extracted. MPEG-7 is described in further detail in Protégé Another important tool for constructing ontologies is the editor. Ontology editors are applications designed to assist the designer in creating new ontologies and managing them. 18

19 They are usually built on top of one of the previously mentioned languages (namely RDF and OWL). One of the most prominent of such editors is Protégé developed at Stanford University and is a free, open source ontology editor and knowledge-base framework.[9] Ontologies built in Protégé can be exported as RDF, RDFS, OWL and XML Schema. It is based on Java and is extensible with plugins. It supports two main ways of modelling ontologies: Protégé-Frames editor allows constructing and populating frame-based ontologies that are OKBC-compliant (Open Knowledge Base Connectivity, which is a protocol and API to access object-relational databases and ontologies). Following this approach, an ontology is structured as a class hierarchy to represent the domain's concepts, together with specific instances, properties of classes and their interrelationships. Protégé-OWL editor allows developers to build ontologies for the Semantic Web using OWL. Illustration 2: A screenshot of the IDE. A sample class hierarchy is shown on the left panel. 19

20 Protégé comes with an API which allows other applications to access the knowledge base. For example, it provides methods such as getknowledgebase() which enables access to the content within the ontology. It also provides tools and functions to query and expand ontologies M-OntoMat-Annotizer M-OntoMat-Annotizer is an image annotation tool and was developed within the acemedia project. It presents a graphical interface and automated functions for loading and processing image content, segmentation, annotation extraction, segment selection. It essentially provides the ability to build a knowledge base which can be used to perform content analysis and reasoning.[17] It supports a lot of different plugins such as the Visual Annotation Framework (VAF) and Visual Description Extraction Tool (VDE). VAF supports automatic segmentation and annotation for both segments and images, by assigning domain ontology concepts to both segments and images. VDE manages the extraction of low-level features and facilitates the loading and processing of visual content (such as images and videos), the extraction of visual features and linking with domain ontology concepts. VDE allows the user to draw a selection region within the image and extract multimedia descriptors only from the selected region. By using an ontology browser the user can select a domain concept and extract visual descriptors from the selected region of image that match the chosen concept, therefore filling the semantic gap. VDE saves the associations between domain ontology concepts and visual descriptors in RDFS files.[18] 20

21 21

22 3.6 Multimedia ontologies Multimedia ontologies are usually constructed by employing a diverse set of tools and approaches. Ordinary ontologies are not themselves sufficient to deal with the problem of search and retrieval of multimedia information. This is clear if we think of the diverse and complex nature of multimedia formats such as video and audio. How does one look for a video or audio track containing the desired content? The problem itself is not trivial nor is its solution. Multimedia ontologies allow both users and applications to access and process descriptions on a shared knowledge base. They're basically analogous to domain-specific ontologies that focus on multimedia objects and their low-level descriptors which describe the structure of the object itself. Such low-level features are machine-oriented and easily extracted from multimedia documents. The high-level semantic features are instead the focus of domainspecific ontologies and usually involve a combination of manual and semi-automatic annotation. As previously mentioned, it is necessary for most approaches to be compatible with MPEG- 7 or similar standards, since the purpose of a multimedia ontology is mapping low-level descriptors to high-level features and MPEG-7 provides the low-level visual descriptors. However, before delving deep into the world of ontologies and the several options available for constructing them, it is worth providing an overview on what paths an ontology designer can follow in order to achieve the goal. Ontologies can essentially be categorized based on their indexing approach: Text-based: annotations and other metadata are the foundations of indexing based on textual information. Metadata provides the necessary information about a multimedia 22

23 object (for example, a picture) and thus, the user can easily search and retrieve content by querying the ontology (for example, by using filters, searching text etc). These ontologies are relatively easy to construct but they can have consistency problems since annotations are manual and subject to ambiguity and inaccuracy. Content-based: the physical, low-level features of multimedia objects are extracted (such as color, shape and texture). This approach requires sophisticated techniques and computer vision software. The wisest and often optimal choice is a combination between the two approaches. Indeed, it is often the case that when constructing an ontology, annoationas and visual features both contribute to reliability and constistency of the search and retrieval of multimedia documents. Another and somewhat related classification of ontologies divides them into heavy-weight and light-weight. Heavy-weight: fully described and fleshed out ontologies. They define several concepts and interrelationships between them, they make use of a large amount of axioms to model knowledge. They are capable of semi-automatic annotation and extraction of semantics from multimedia content. Ontologies in this category are usually constructed by using MPEG-7 and M-OntoMat-Annotizer. They make use of both high-level domain concepts and low-level visual descriptors. Light-weight: partially described ontologies, which include taxonomies and semantic hierarchies. Such hierarchical conceptual structures provide the backbone for image annotation by exploiting the mapping of concepts to common visual features and the semantic interrelationships between concepts themselves. This can be done by combining WordNet with visual appearance learning approaches based for example, on clustering. 23

24 ImageNet is a large visual database for use in visual object recognition software built ontop of WordNet. However, since hierarchies ignore visual features, they are not sufficient in constructing a reliable multimedia ontology, other techniques are used in conjuction with them to create visual vocabularies. 24

25 4 Foundational ontologies Also known as upper ontologies, core ontologies or top-level ontologies, they consist of general concepts and relations that can be commonly found across multiple domains of knowledge. They function as a starting point for constructing more specialized ontologies by specializing and deriving concepts that are more closely related to the domain of interest. When it comes to instantiating multimedia ontologies, DOLCE is usually the preferred choice and it is in fact employed by several multimedia projects such as COMM and M3O. 25

26 4.1 DOLCE [4]The Descriptive Ontology for Linguistic and Cognitive Engineering is one of the most commonly used core ontologies. As a foundational ontology, it only includes and defines general, reusable and versatile upper-level categories and concepts. It is built drawing inspiration from several fields such as mathematics, philosophy and psychology and it was designed to be reusable and modular. It can be extended using several models to cover different domains and it is also meant to be used in conjuction with other ontologies, especially domain-specific ones. It is also the first module of the WonderWeb Foundational Ontologies Library (WFOL) which is a modular collection of core ontologies. WFOL is a good starting point for building new ontologies, since it's easier to model specific domains by starting from the most general concepts. DOLCE is cognitively biased towards human-like perceptions and abstractions which makes sense if we think of its usage within the context of the Semantic Web. Search, retrieval, relationships, associations are all about what the user wants, thinks and expects. It makes sense therefore, that a foundational ontology such as DOLCE reflects this in its design (the keywords concerning DOLCE design philosophy are included in the acronym itself). In other words, DOLCE concepts, abstractions and relationships are heavily influenced by human cognition, language and perception. 26

27 4.1.1 Basic categories The basic taxonomy is presented in the image above. In this paragraph I introduce some of the more important categories. Endurants: are entities that can be perceived as complete objects independently of time. At any given moment we can still recognize the endurant as an entity in its entirety. For example, material objects are endurants as well as abstract monolithic concepts such as organizations. They can change in time. The important property to consider here is that Endurants can survive as entities even when they lose some of their parts. Perdurants: are entities for which only a part exists at any given time. They're also known as events, occurrents or processes. Both Endurants and Perdurants are related to each other by participation. An endurant (object, person, subject) lives by participating in a perdurant (process, event etc). Parts cannot be removed from Perdurants (they do not have temporal parts). 27

28 Qualities: also known as properties. They do not have autonomous existence but rather, require the existence of an entity. Color, temperature, height are all properties of entities. They can also be seen as basic entities that are perceived or measured Qualities and quality regions Qualities are endurant entities that can be perceived and measured such as physical traits / features. Qualities depend on the entity they're attached to for their existence. These qualities can belong to a finite set of quality types and are unique to specific entities. Each quality has a value (for example, in the case of color it can be an RGB value). The value of a quality is also called quale, which represents the position of the quality within its quality space (which is basically a set of values). The difference between quality and qualia can be easily explained with an example: The white note turns yellow, we're talking about an entity (the note) which changes the quality color in time. White is brighter than yellow is instead a comparison between two qualias in the quality space. Each quality type has an associated quality space with a specific structure (i.e. metric linear space for length and height). Space, time, color, height, weight etc are all quality types. If the qualities are either space or time, the quale is called region (which can be a temporal/spatial region). So, the location in space of an object is a quality of the type space (in a sense, quality is an instance of a quality type) and its quale is a region in geometric space. 28

29 4.1.3 Functions and relations Parthood: x is part of y P(x, y) (AB(x) PD(x)) (AB(y) PD(y)) This relation is related to Perdurant entities. Temporary Parthood: x is part of y during t P(x, y, t) (ED(x) ED(y) T(t)) This relation is related to Endurant entities since it is necessary to know when a specific parenthood holds. Constitution: x constitutes y during t K(x, y, t) ((ED(x) PD(x)) (ED(y) PD(y)) T(t)) Participation: x participates in y during t PC(x, y, t) (ED(x) PD(y) T(t)) Quality: x is a quality of y qt(x, y) (Q(x) (Q(y) ED(y) PD(y))) Quale: x is the quale of y (during t) ql(x, y) (TR(x) TQ(y)) ql(x, y, t) ((PR(x) AR(x)) (PQ(y) AQ(y)) T(t)) 5 Taxonomy-driven ontologies One sensible approach in designing an ontology would be using a semantic-lexical database (i.e. WordNet) to generate the words with which multimedia objects are annotated.[12] This can be done by employing a hierarchical structure (which constitutes what is known as the taxonomy) which allows for a hierarchical classification of the objects themselves. This is a very effective step in constructing multimedia ontologies if we think of the system from the perspective of the end user: it is much simpler and more convenient to request information and multimedia content by using something that is very close to the natural language of the user. The early approaches were based on low-level features such as color, texture etc. A good analogy would be a modern DBMS and SQL. 29

30 The latter allows the user to compose queries for data-retrieval with a syntax and morphology meant to emulate a human language. Queries based on data size or other lowlevel attributes would not be very convenient nor effective at retrieving a desired piece of information from a database. The annotation is usually done by annotating images as a whole or specific subregions of the image. It is then possible to infere new meaning and related concepts from an annotated region based on the hierarchical structure of the underlying taxonomy. This has some notable advantages when it comes to filling the semantic gap between high-level features and low-level features. For instance, let's say that from a training set of known images we've derived the concepts of car, truck, van, motorcycle etc. Such concepts were properly labelled by the words provided by the semantic database. We also know that each of those words are hierarchically organized and are subordinate to the word vehicle, which of course repesents the category to which all those other concepts belong to. We can use such hierarchies for identifying visual similarities among different vehicles. In short, an ontology requires an organization of concepts of interest and the relationships among them. Using WordNet for automatic annotation of images increases the number of concepts that a system can recognize. Another explicative example: suppose we have several concepts such as dog, cat, cow, eagle etc. All of those are classified under the concept node animal. It is possible to annotate an image with such a word even though there are no images with such annotation. Regions of images corresponding to specific animals (dog, cat etc) can be used to learn about the more general concept of animal and subsequently using such concept in future annotations and retrieval queries. 30

31 5.1 Machine learning approach It is possible to use machine learning and computer vision approaches for instructing an algorithm to identify concepts and automatically annotate images. This is typically done by using a training set of images in order to create a correlation between cooccurrences of certain words and images, or more often, image regions. Regions within an image can be identified following different methods, some more sophisticated than others. It is possible to use a combination of filters and edge detection algorithms to try and identify shapes and semantics within an image. However, the simplest approach would be dividing an image into a dumb grid of tiles of equal size. As mentioned earlier, concepts can be defined in text by using a knowledge base such as WordNet. The hierarchy underlining the annotion words provided by the taxonomy can be used in the translation model for image annotation. In such an approach, images are divided into subregions. Vectors containing low-level features for each region are clustered to generate a visual vocabulary. Each element of the vocabulary is known as a blob. Each region of an image is mapped to a blob. Blobs can be translated into words that can be traced within the WordNet taxonomy and from there, related meanings and concepts can be found. 31

32 5.1.1 K-means clustering It is a popular algorithm used in vector quantization. Generally speaking, quantization algorithms divide a large set of vectors into groups (or clusters ). Each cluster is represented by a centroid point. In the specific case of K-means, n vectors are partitioned into k clusters and each vector belongs to the cluster with the nearest mean. Heuristic algorithms are used since the problem is NP-hard in nature.[13] In conjunction with k-means clustering, the loosely related k-nearest neighbour (with k=1) can be used on the cluster centers obtained by k-means to classify new data into existing clusters (this mixed approach is also known as Nearest centroid classifier or Rocchio algorithm). The quality of the clusters obtained through this algorithm depends on the choice of the cluster centers. As one might guess, choosing the centroids randomly does not lead to optimal effectiveness. Instead, we can use the hierarchy defined by the text ontology to select the centers. We can now truly appreciate the advantage of using a text ontology. Given the fact that words belong to a hierarchy which includes other words of possibly related meaning and concepts, the regions are grouped under each node of the hierarchy. In other words, given a region r of the image I, r is placed under concept w if one of the following conditions is met: 1. w is an annotation word for image I. 2. one of w's child nodes is a valid annotation for image I This means that, if condition 2) is met, image regions placed under a concept may not contain the elements which the word assigned to it suggests. (For example, a beach scene with a clear sky and the sun. The region containing the sun may be under the concept 32

33 beach even though the beach itself is not contained within that particular region). Nevertheless, averaging the feature vectors of the regions in this set produces a good semantic candidate as cluster center. Illustration 3: Wikipedia image representing a K-means clustering algorithm. 33

34 6 MPEG-7 based ontologies These ontologies are often constructed with MPEG-7 in conjunction with a domainspecific ontology. M-OntoMat-Annotizer is an important piece of software that allows mapping of visual low-level descriptors to high-level semantic features and is often used for authoring ontologies that employ the MPEG-7 standard. This section highlights some of the most used approaches for constructing multimedia ontologies, and it may be argued that they do in fact represent the de-facto standard in the field. 34

35 6.1 MPEG-7 [6]Standardized in ISO/IEC 15938, it is a multimedia content description standard and formally known as Multimedia Content Description Interface. It provides a set of Description Tools for describing multimedia content (audio and video). This differs from previous standards such as MPEG-1 and MPEG-2 which are aimed at compressing multimedia data. The goal of MPEG-7 is describing multimedia content with XML independently of the structure of the object itself. Just like XML on which it relies, MPEG-7 does not provide standardized methods for extracting or searching metadata, but rather provides standardized description of multimedia data. Descriptions can be compressed into binary format (Binary format for MPEG BiM) Components Multimedia Description Schemes (MDS): Descriptors (D): define syntax and semantic of each feature (metadata). Descriptor Schemes (DS): define structure and semantics of the relationships between different metadata (D or DS). Data Description Language (DDL): defines the syntax of Description Tools. System Tool: provides the methods for representing contents in binary format (useful for more efficient memorization and transmission, for example over a network). It also defines the transmission mechanisms, synchronization of descriptions in respect to contents etc. 35

MPEG-7 essentially provides two main functionalities: the decomposition of media assets and annotation of the decomposed parts (called segments).

36 MPEG-7 essentially provides two main functionalities: the decomposition of media assets and annotation of the decomposed parts (called segments). A segment is the elementary abstract concept and can refer to parts of multimedia objects such as a region of an image, a scene from a video or a textual fragment. MPEG-7 provides low-level descriptors for structural (spatial and temporal) properties of multimedia data as well as descriptors related to the annotation of segments. Such descriptors can either be low-level visual/audio features or more abstract concepts. Illustration 4: Multimedia content is an abstract concept in MPEG-7. It is further specialized into into concrete multimedia content types as shown in this image 36

37 6.2 COMM [3][5]The Core Ontology for MultiMedia is implemented using OWL DL and was conceived with the idea of facilitating the annotation of multimedia objects and to be MPEG-7 compatible, and is considered a standard for multimedia annotation. COMM is also compliant with all the other requirements introduced in section 3.2. It is used with DOLCE as a core ontology and is designed using two ontology design patterns ( ODP): Descriptions and Situations (DnS) Ontology for Information Object (OIO) COMM provides a Java API for an MPEG-7 class interface which allows developers to create applications for constructing metadata at runtime.[11] One of the problems of using MPEG-7 alone is that first and foremost, it is most effective at automatically extracting low-level features whereas annotations and semantics are described by XML fragments and therefore, they can vary between applications and sources. Every application will use different tags, attributes and entities to describe the semantics of a given multimedia object, which does not guarantee semantic interoperability. The example provided below will illustrate the point: 37

Let's assume that the image above was processed by two different web applications for identifying two regions and extracting their semantics.

38 Let's assume that the image above was processed by two different web applications for identifying two regions and extracting their semantics. MPEG-7 snippet code was used to identify two regions REG1 and REG2, respectively containing a man and his bike. We are interested in loading this image into a multimedia system of our own design. Also, consider the XML code below: 38

39 <Mpeg7> <Description xsi:type= ContentEntityType > <MultimediaContent xsi:type= ImageType > <Image id= IMG1 > <SpatialDecomposition> <StillRegion id= REG1 > <Semantic> <Name>Bike</Name> </Semantic> </StillRegion> <StillRegion id= REG2 > <TextAnnotation> <Keyword>Bike</Keyword> </TextAnnotation> </StillRegion> The code in bold represents the interoperability issue with MPEG-7 annotation. The two descriptors are the result of two different extraction processes performed by two different web services. While both are semantically and syntatically correct, they are not treated the same by our multimedia system since the latter cannot be designed to anticipate every possible syntactic variation generated by every external service in existence. This makes MPEG-7 incompatible with web technologies / semantic web since the standard itself does not provide a formal description of the semantics incapsulated in a multimedia object. 39

6.2.1 DOLCE as Core Ontology It is necessary to find a way to express the semantics contained in MPEG-7 annotations to guarantee semantic interoperability.

40 6.2.1 DOLCE as Core Ontology It is necessary to find a way to express the semantics contained in MPEG-7 annotations to guarantee semantic interoperability. Instead of using the XML Schema, a Core Ontology can be used as the foundation for COMM. By doing so, a domain independent vocabulary can be created that formally defines the basic conceptual categories. The chosen top-level ontology is DOLCE. As mentioned in the previous paragraph, the two patterns used to design COMM are D&S and OIO, provided by the DOLCE core ontology. These need to be extended in order to represent MPEG-7 concepts since they're not sufficiently specialized for multimedia annotation. MPEG-7 provides decomposition and annotation (descriptors to annotate each segment of an object). D&S and OIO translate MPEG-7 technical concepts into DOLCE vocabulary. The patterns are centered around Endurants and Perdurants. Following the OIO pattern, we define multimedia-data as the abstract concept which can be specialized into concrete multimedia content types (such as image-data). Multimedia is realized by some physical media (such as an image). This concept is used to annotate the physical realization of multimedia content. 40

41 6.2.2 Digital Data Pattern Both the multimedia content and the annotations are digital data. Digital-data entities express descriptions, in particular structure-data-descriptions, which define meaningful labels for the information within digital-data. The nature of the information can vary, and is usually made of scalars, vectors, string etc. Within the realm of DOLCE, these entities are mapped to abstract-regions. This pattern is basically used to formalize MPEG-7 lowlevel descriptors Decomposition Pattern Concerning the decomposition part of the problem, according to the D&S pattern, the decomposition of a multimedia-data entity is a situation (segmentdecomposition) that are explained by a description such as a segmentationalgorithm (automatic) or a method (manual) which have been applied to decompose the multimedia-data. Both segmentation-algorithm and method define roles. Of notable importance is the root-segment-role concept, since multimedia-data can be seen as a tree of segments each of which can be annotated separately. When a method creates new multimedia content, a new segment with rootsegment-role is created which represents the whole object until further segmentation is performed. 41

The meaning of digital data depends on the context. Digital data entities are connected through computational situations, such as input and output data of an algorithm.

42 The meaning of digital data depends on the context. Digital data entities are connected through computational situations, such as input and output data of an algorithm. Algorithms are descriptions and annotations are situations that the rules of an algorithm or method. DOLCE allows to define a clear taxonomy which enhanced MPEG-7 capabilities, allowing for a more extensible multimedia ontology Content Annotation Pattern This pattern formalizes the attachment of metadata, such as annotations, to multimedia-data. Using D&S pattern, Annotations also become Situations which represent the state of affairs of related Digital-Data (metadata and annotated multimedia-data). Digital-data entities represent the attached metadata by playing an annotationrole. The roles are defined by method (manual annotation) or algorithms (automatic annotation). The multimedia-data entity being annotated plays an annotateddata-role. The metadata attached to a DigitalData entity depends on the 42

43 StructuredDataDescription which are formalized by using the Digital Data Pattern. Formalizing an annotation requires specialization of Annotation, which creates a connection between DOLCE's concepts and MPEG-7 descriptors Semantic Annotation Pattern MPEG-7 is not suitable for performing fully fleshed out semantic annotation. A domainspecific ontology is much better for this task, since it can provide better interpretations of multimedia content. This pattern focuses on connecting multimedia descriptions with domain descriptions provided by other ontologies. A dolce particular that represents the semantics of a multimedia object is connected to the multimedia data through the way the annotation was obtained, namely an algorithm or method. The semantic-annotation entity is a description that satisfies the applied method and specifies that the annotated multimedia-data plays an annotated-data-role while the particular plays an annotatedlabel-role. 43

44 6.2.6 Conclusion COMM is MPEG-7 compliant since the patterns adopted all lead to mapping MPEG-7 descriptors to DOLCE concepts. When using OWL, string representation formats are used to serialize data types concepts not covered by W3C standards (for example, shapes). The system is also extensible since we can always add more roles and parameters. Finally, COMM assures Semantic Interoperability through the Semantic Annotation Pattern. 44

45 6.3 Visual Descriptor Ontology (VDO) [1][2]VDO deals with semantic multimedia content, and was developed within the acemedia project and used in the automatic semantic analysis process. The goal of acemedia was to find ways to improve online market accessibility by generating semanticbased, user aware content. ACE stands for Autonomous Content Entity and was meant to help users create, share, find and re-use multimedia content. It was an international effort led by several European companies and institutions. VDO revolves around the MPEG-7 format which provides the visual descriptors and lowlevel, machine-oriented features used for multimedia reasoning. As mentioned earlier in this paper, MPEG-7 uses XML to integrate metadata with multimedia content. This however, leads to incompatibility for Semantic Web applications and issues regarding semantic interoperability, since different domains are not generally capable of interpreting MPEG-7 descriptors without ambiguity. That is why a core ontology and eventually, a domain ontology, must be used in order to perform content reasoning. VDO contains visual descriptors and models concepts and properties that describe visual features of multimedia objects (shape, color etc). [11] In order to create the ontology,it is necessary to introduce some concepts that are not natively present in MPEG-7 XML Schema. These are: Region, Feature, VisualDescriptor and Metaconcepts. Region is a concept that identifies a portion of an image and a VisualDescriptor can be attached to each Region. Feature on the other hand, is made up of four concepts: Color, Shape, Motion, Texture. These are connected to subclasses of VisualDescriptor, which is the top concept based on the visual descriptors in MPEG-7. 45

A system based on this ontology can be created to automatically analyze images and extract a visual descriptor from the regions within an image.

46 A system based on this ontology can be created to automatically analyze images and extract a visual descriptor from the regions within an image. Once the Visual Descriptors are extracted from an image, these need to be mapped to the VD ontology in order to infere their semantics, in other words we are interested in mapping visual descriptors to high-level semantic concepts from the ontology (content reasoning). A combination of semantic reasoning and comparison of low-level features can be used to improve the outcome of search and retrieval operations. Currently, VDO has been aligned to the DOLCE ontology, which has become a really popular core ontology. 46

47 7 Other multimedia ontologies In this chapter I introduce some commonly used ontologies that do not fit within one of the previous category and are based on different standards and formats. 47

48 7.1 M3O [5][8][11]M3O (Multimedia Metadata Ontology) is widely used in the context of Semantic Web and Web technologies where the focus is usually on several types and formats of multimedia content (audio, video, images etc) and it may be important to establish semantic relations between the content in a video and that of an image. Many solutions only focus on a single media type which makes them poorly suited for the diverse nature of multimedia content on the Web, where more often than not several and different applications are meant to work synergically or concurrently. M3O is ideal for annotating rich, diverse and structured multimedia content on the Web and extracting its semantics, making it both human-readable and machine-readable. This ontology can be easily integrated with modern presentation formats such as SMIL, SVG and Flash (however, Flash seems to have declined in popularity in favor of newer emerging standards). Just like COMM introduced in 6.2, M3O uses DOLCE as its core ontology and supports both low-level features and high-level semantic annotation Multimedia Metadata Model M3O is used for rich presentations on the web using SMIL, SVG and Flash. It provides patterns meant to satisfy certain requirements which are themselves derived from the necessity to handle rich multimedia presentations. Let's take into consideration a SMIL presentation to illustrate what is required exactly for annotating structured multimedia content. Structured means that the content is made up of several discrete elements, such as multiple images, text and eventually input from the 48

49 user (such is the case with a presentation). It is possible but difficult to annotate structured content using MPEG-7, which is why M3O is better suited than COMM for this kind of scenario.[8] There are three main requirements for annotating structured multimedia content: Information Objects Information Realizations separation: multimedia content is a message to the user (much like ordinary text is). We can define this message as an information object (abstract) which can be realized by different information realizations (concrete). In other words, the same message can be realized in different ways. It can be a SMIL presentation, AVI video, MP4 video, powerpoint presentation etc. Information Objects Information Realizations annotation: needless to say, the model must support annotation of multimedia content. In the context of M3O this is often done by EXIF in the form of key-value pair (dictionary). Information Objects Information Realizations decomposition: it is necessary to decompose multimedia objects into their constituents. In the case of a SMIL presentation this is done by decomposing it into images EXIF EXIF stands for Exchangeable image file format and specieis the formats for images and sound used and produced by several types of devices such as scanners, cameras and smartphones. It also introduces metadata tags which add datetime information to files produced by digital cameras, thumbnail information and descriptions. EXIF metadata are restricted in size to 64 kb in JPEG images since JPEG stores metadata in application segments as defined by the JPEG standard. EXIF is not as complex or as versatile as MPEG-7. 49

50 7.1.3 Patterns M3O is implemented with Semantic Web technologies, allowing to represent both highlevel and low-level features. This is advantageous also because SMIL and SVG natively define the use of RDF for modelling annotations. Furthermore, M3O is created following a pattern-oriented ontology design approach. Annotation pattern: attaches a note to an entity which describes it (for example, tags on an imagine). It is the attachment of metadata to an information entity within the context of a computer system. Decomposition pattern: models the decomposition of a structured multimedia object (i.e. SMIL presentation) into its logical parts or segments (if it's an image). This pattern decomposes the composite into components. It is possible to annotate each component individually. DOLCE+Dns Ultralight patterns: Descriptions and Situation Pattern: It is a pattern that allows to establish relations between entities within a specific context. Roles and types are also only valid within a given context. The basic concepets are Description, Concept, Entity and Situation. A Situation satisfies a Description, which in turn defines the roles and types in a Concept (which represents a context). Each concept classifies an Entity. Finally, an Entity is related to a Situation. Concepts can also be related to other concepts. Information Realization Pattern: models the distinction between information objects and information realizations. In the case of a SMIL presentation, the presentation itself is the realization while the content (semantics, meaning) is the information object. 50

Ontologies help us define concepts with which we can represent data and relationships between different entities, but such entities will eventually need to be represented using raw data types such as

51 Ontologies help us define concepts with which we can represent data and relationships between different entities, but such entities will eventually need to be represented using raw data types such as strings and numbers. The Data Value Pattern assigns a data value to attributes of entities. The Quality concept represents the attribute of an entity. The figure shows how concrete data values are connected to the higher level, abstract Quality concept. Region is a concept that represents the data space the value belongs to. The data value can be encoded using XML Schema Datatypes. Finally, the haspart relation in Quality allows us to model structured data values (vectors, matrices, etc).. 51

Towards the Semantic Web

Towards the Semantic Web Ora Lassila Research Fellow, Nokia Research Center (Boston) Chief Scientist, Nokia Venture Partners LLP Advisory Board Member, W3C XML Finland, October 2002 1 NOKIA 10/27/02 -