CHAPTER 1 INTRODUCTION - PDF Free Download

1 CHAPTER 1 INTRODUCTION Most of today s Web content is intended for the use of humans rather than machines. While searching documents on the Web using computers, human interpretation is required before any useful information can be extrapolated. Computers can present the user with information, but cannot understand what information is most relevant to the user in a given circumstance. The Semantic Web, on the other hand, is about having data as well as documents on the Web so that machines can process, transform, assemble, and even act on the data in useful ways. For instance, with the application of Semantic Web technologies, it is possible to automate operations, say, from completing all that the user needs for a trip to updating of his/her personal records. Semantic Web therefore can be defined as a web of information on the Internet and Intranet that contains characteristics of annotation which enables accessing of precise information that the user needs (Berners-Lee et al 2001). As a result, the data can be used and shared in effective ways between cross applications. 1.1 SEMANTIC WEB The Semantic Web is an extension of the current Web, in which the given information is well-defined, better enabling computers and people to work in co-operation (Berners-Lee et al 2001, Shadbolt et al 2006). In the Semantic Web vision, ontologies are used to semantically annotate the current information on the web. Currently, there are a variety of semantic

2 representation languages, including RDF Schema and OWL. Given these ontologies and semantic annotations of data called semantic metadata with respect to them, machines will thus be able to efficiently and in a more automated manner, interpret the data on the Web. Hence, machines or agents, will be able to understand and act upon information regarding both the entities and relationships contained on the Web. Figure 1.1 Semantic web stack The development of the Semantic Web proceeds in steps, each step building a layer on top of another. Figure 1.1 illustrates the hierarchy of Semantic Web layers (Berners-Lee et al 2001). The first layer of the Semantic Web layer cake encompasses the Unicode and URI. These are the foundations of the stack. They are used to identify resources with unique identifiers. The second layer is XML and XML Schema, which are syntax languages for representing structured information. The third layer is RDF, which is more expressive than XML and the data model for the Semantic Web. The next layer is RDF Schema (RDFS), a

3 vocabulary language for RDF. In the next layer, the OWL ontology language and the RIF rule language for the Semantic Web are presented. SPARQL is a query language and protocol for the Semantic Web. On top of the representation layers is the Unifying Logic layer, which is used to reason over RDF statements. The next layer is the Proof which is used to validate the RDF model. The trust layer is the next layer to support the security of the Semantic Web. Finally, the user interface and applications layer sits on top of the Semantic Web stack. An overview of the Semantic Web technologies is presented in the following sub-sections. 1.1.1 Unicode and Uniform Resource Identifier (URI) The first layer, URI and Unicode, follows the important features of the existing WWW. The format which shall enable character representation of all languages depicts the Unicode where as URI is a string of characters used to uniquely identify resources on the Internet. They are the foundations of the Semantic Web for identifying resources with concrete serialization syntax. 1.1.2 Extensible Markup Language (XML) Extensible markup language is designed to transport and store data. XML is the syntactic basis for RDF and the Semantic Web (Bray et al 2000). XML is a simple, flexible text format to structure data, allowing for interoperability and data sharing. Its data model is an ordered, labelled tree. XML makes use of tags and attributes, which have no pre-defined meaning. 1.1.3 Resource Description Framework (RDF) The Semantic Web is based on resources described in the open data model RDF, a framework for the representation of Web data and meta-data (Carroll and Klyne 2004). RDF facilitates the automated processing and

4 integration of data from diverse sources. Its data model is a directed, labelled graph. An RDF graph consists of a set of RDF statements, i.e. triples. A triple consists of a subject, a predicate and an object. The subject and object are the nodes in the graph and the predicate is an edge directed from the subject to the object, indicating a relationship between them. The subject, object and predicate can be represented as follows: Subject - RDF URI reference or blank node Predicate - RDF URI reference Object - RDF URI reference, literal or blank node An RDF URI reference identifies an RDF resource. A blank node is a node without RDF URI reference. Literals are strings that can be plain or have a data type. RDF has a number of serialisation formats, such as RDF/XML or Turtle (Beckett 2004). 1.1.4 RDF Schema (RDFS) RDF Schema (Brickley and Guha 2004) is one of the languages to represent information on the Semantic Web, which is abbreviated as RDFS. RDFS is based on RDF and supports the formal definition of domain-specific vocabulary to describe resources that occur in RDF data models. The basic concepts of RDFS are classes and properties as well as hierarchies among them. For properties, domain and range restrictions can be specified. This is sufficiently expressive to formalise taxonomies, i.e. classification-hierarchies that are the skeletons of ontologies. FOAF is an example of a well-known RDF vocabulary (Brickley and Miller 2005).

5 1.1.5 Web Ontology Language (OWL) A more expressive representation language is OWL (Patel-Schneider et al 2004), which is applied to explicitly represent and formally describe ontologies, i.e. the meaning of terms in vocabularies and the relationships between them. OWL provides three sublanguages with increasing expressiveness: OWL Lite, OWL DL and OWL Full. OWL Lite is the least complex sub-language. It provides mechanisms for defining property cardinalities and is designed for modelling simple taxonomies and to allow basic inference. OWL DL provides the maximum amount of expressiveness that guarantees all conclusions to be decidable and computable in finite time. Therefore, it allows the usage of the same language constructs as OWL Full, but under certain limitations. DL stands for description logic (Baader et al 2003), a decidable fragment of first order logic. OWL Full allows the full usage of RDF-S without restrictions. Expressions in OWL Full may be contradictions and conclusions that may not be computable. 1.1.6 Rules Layer: Rule Interchange Format (RIF) The aim of the rules layer is to provide appropriate languages for representing rules on the Semantic Web and currently it sits alongside the ontology layer. The RIF is a W3C working draft recommendation, which aims to develop an interchange format for different rule languages and inference engines, so that machines can share rules on the Semantic Web. 1.1.7 SPARQL A query language for databases, used to retrieve and manipulate data stored in Resource Description Framework format is called SPARQL

6 (Prud Hommeaux and Seaborne 2006). It contains capabilities for querying graph patterns along with their conjunctions and disjunctions, supporting value testing and constraining queries by source RDF graph. The results of SPARQL queries can be sets or graphs. A relevant concept of SPARQL is the basic graph pattern, a set of triple patterns that matches a sub graph of an RDF dataset when its variables can be substituted by RDF terms from the sub graph. The result of such a query-pattern is the set of all distinct variablebindings that are possible, based on the queried RDF dataset. 1.1.8 Logic Layer and Inference Logic layer should be able to verify the trustworthiness or authentication of the document it is a reasoning system provided on top of the ontology structure to make new inferences. Reasoners are software tools for inferring conclusions from asserted facts. Most of the semantic reasoners utilize first-order predicate logic for performing inferencing; reasoning is based on inference rules, which are generally specified according to the ontology language. Jena, Pellet, KAON2, and FACT are examples of semantic reasoners. 1.1.9 Proof Layer The aim of the proof layer is to validate information generated as RDF, such as the provenance knowledge or the form of reasoning that is used. 1.1.10 Trust Layer The main point of the Web is anyone can say anything about anything. Therefore, when we are selecting a resource on the Web we are putting our trust in it. We make trust judgments based on a source s perceived reputation or previous personal experience and so on. The same is true for the

7 Semantic Web. Encryption mechanisms should allow people to sign up to trusted metadata on the Semantic Web. In addition, semantic agents need to make judgments when alternative sources of information are available. The aim of the trust layer is to shed light on these problems. 1.1.11 User Interface and Applications Semantic Web technologies are basically machine-oriented. Formal models are used to express data so that machines can reason on them. However, Semantic Web applications are not only machine-oriented, they will also support users. This layer of the Semantic Web stack is for useroriented applications to improve the user s experience on the Semantic Web. Examples of user-oriented Semantic Web-enabled interfaces to support user access to the Semantic Web are as MSpace (Schraefel et al 2005) or interfaces like COHSE (Carr et al 2001) and Magpie (Dzbor et al 2003). 1.2 METADATA EXTRACTION TECHNIQUES The process of extracting additional information from the Web resource with respect to ontology is called metadata extraction and this is required for our experiments. Both semi-automatic (Handschuh et al 2002) and automatic (Hammond et al 2002) techniques and tools have been developed. Various tools exist, including Cream7 (Handshuh et al 2003), S-Cream (Handschuh et al 2002), Semagix Freedom toolkit (Hammond et al 2002) and SemTag (Dill et al 2003). Semagix Freedom has typically been used to populate ontologies that average more than one million instances (Sheth et al 2003) and can process over a million documents per day per server, while SemTag, which is part of IBM s WebFountain project, has used a smaller ontology but has demonstrated Web scale metadata extraction from well over a billion pages. Other tools used to extract the metadata are OpenCalais, PyRdfa, GRDDL, RDF Distiller.

8 1.3 RDF DATABASES AND STORAGE SYSTEMS Currently there are numerous RDF storage systems, some of which are briefly discussed below: 1.3.1 Jena Jena is an open source Semantic Web framework for Java. It provides an API to extract data from and write to RDF graphs. It was developed by the HP research labs in Bristol (McBride 2001). The Jena API provides both statement centric and resource centric methods for manipulating an RDF/OWL model. Additionally, the API provides built in support for RDF containers (bag, alt and seq) and typed literals. Jena also provides integrated parsers and writers for RDF in various formats. This easily allows the importing and exporting of serialized RDF/OWL (McBride 2001). Jena provides a persistence subsystem that provides persistence for models through the use of a back-end database. The default Jena database layout uses a de-normalized schema in which literals and resource URIs are stored directly in statement tables. Additionally, the persistence subsystem provides support for RDQL, which is dynamically transformed into SQL queries. Jena is currently compatible with MySQL, Oracle and PostgreSQL. Jena also provides a reasoner subsystem that includes a rule based inference engine together with configured rule sets for RDFS and basically the OWL-Lite subset of OWL Full. The reasoner subsystem is extensible in that it is possible to use a variety of external reasoners in Jena. Additionally, it provides an ontology API, which is designed to be used by programmers who are working with ontology data based on RDF. Currently, OWL, DAML+OIL and RDFS are supported. Jena has been used in the proposed method for accessing the RDF data storage.

9 1.3.2 RDFSuite A set of highly scalable tools used for managing volumes of RDF description bases and schemas are called RDFSuite (Alexaki et al 2001), which was developed by ICS-FORTH. Currently, the RDFSuite includes a Validating RDF Parser (VRP), a RDF Schema Specific DataBase (RSSDB) and support for RQL. VRP provides support for analyzing, validating and processing RDF schemas and resource descriptions. The parser syntactically analyzes the statements of a given RDF file according to the RDF specification. The validator checks whether the statements contained in both RDF schemas and resource descriptions satisfy the semantic constraints derived by the RDF Schema Specification. Additionally, RDFSuite includes RSSDB, which is a persistent RDF data store for loading resource descriptions in an object-relational DBMS by exploiting the available RDF schema knowledge. The main goal of RSSDB schema-specific representation is the separation of the RDF schema from data information, as well as the distinction between unary and binary relations holding the instances of classes and properties. RSSDB is comprised of a Loading and an Update module, both implemented in Java using a number of primitive methods (APIs) for inserting, deleting, and modifying RDF triples. Lastly, the RDFSuite supports the RQL query language. 1.4 SEMANTIC ASSOCIATIONS A pair of entities can be connected by multiple complex relationships within semantic metadata. As RDF captures the meaning of entities by relating them to other entities, it is a natural fit for storing and acquiring such connectivity. Such relationships are a fundamental aspect for

10 the Semantic Web and are commonly referred to as semantic associations. Query and retrieval of semantic associations has been an important task in many applications such as detecting money laundering, curriculum sequencing in e-learning and mining hidden relationships in bio medical. Semantic association deals with complex relationship in a knowledge base represented in an RDF graph. Semantic associations are based on concepts such as semantic connectivity and semantic similarity. Different types of semantic associations in an RDF graph can be formally defined as follows (Anyanwu and Sheth 2003): Definition 1 (Semantic Connectivity): Two entities e 1 and e n are semantically connected if there exists a sequence e 1, p 1, e 2, p 2, e 3,, e n-1,p n-1, e n in an RDF graph where e i, 1 i n, are entities, p j, 1 j < n, are properties, and entities e i and e i+1 are in relationship p i. A sequence of entities and properties represents a semantic path. Figure 1.2 show the semantic connectivity between entity e1 and e2. Definition 2 (Semantic Similarity): Two entities e 1 and f 1 are semantically similar if there exist two semantic paths e 1, p 1, e 2, p 2, e 3, e n-1, p n-1, e n and f 1, q 1, f 2, q 2, f 3,, f n-1, q n-1, fn semantically connecting e 1 with e n and f 1 with f n, respectively, and that for every pair of properties p i and q i, 1 i < n, either of the following conditions holds: p i = q i or p i q i or q i p i ( means rdf:subpropertyof, which is essentially property/relationship inheritance). We say that the two paths originating at e 1 and f 1, respectively, are semantically similar. Definition 3 (Semantic Association): Two entities e x and e y are semantically associated if e x and e y are either semantically connected or semantically similar.

11 Figure 1.2 Semantic connectivity between entity e1 and e2 The study assumes that an RDF already exists and that the semantic association paths between entities are discovered in an RDF. It should be noted that the proposed work concentrates on ranking techniques for semantic connectivity associations. Note that entity and instance are used interchangeably throughout this document. Similarly, property and relation are used interchangeably as well. 1.5 MOTIVATION OF THE PROBLEM Accessing relevant information from the web has become difficult due to the explosive growth of information on the web. Many users try to acquire this information by using search engines. But search engine based systems locate only documents based on the keywords or key phrases. While considering data on the web, different entities can be related in multiple ways that cannot be pre-defined. But in the Semantic Web, the RDF data model (Lassila and Swick 1999) captures the meaning of an entity or resource by specifying how it relates to other entities or classes of resources. At present, many applications such as intelligence analysis, genetics, pharmaceutical research and flight security require more complex relationships than simple direct relationships between entities.

12 Semantic association Ontology Documents Databases Heterogeneous documents Figure 1.3 Extracting semantic associations Semantic Association is a sequence of complex relationships between entities in the metadata extracted from heterogeneous documents. Metadata extraction is the process of extracting additional information from various resources with respect to ontology. This is illustrated in Figure 1.3, which shows entities and relationships that originate from different sources. Figure 1.3 illustrates the semantic association paths between two entities. Searching semantic relationships among the entities like people, places and events from the semantic web is an essential component in the future. To find semantic association(s), the user can formulate queries as follows: 1. Intelligence analysis - Finding relationships between immigrants, front businesses, and currency transactions to locate money laundering operations 2. Genetics - Discovering complex relationships between proteins and genes

13 3. Pharmaceutical research - Finding counter effects between drugs 4. Flight security - Finding connection between a passenger and one or more passengers on the same flight or different flights. While searching semantic associations in RDF, the result containing multiple paths connecting two entities is perceived. Each path has different meanings depending on the type of relation. Figure 1.4 shows the graph connecting multiple entities and Figure 1.5 shows the Semantic association paths between entity e1 and entity e2. Figure 1.5 illustrates that there may be numerous paths between entities. a f g k e2 e1 h b i d p c Figure 1.4 Graph connecting multiple entities e1 h a g f k e2 e1 h a g f k b p c e2 e1 h a g b p c e2 e1 h b g f k e2 e1 h b k e2 e1 h b p c e2 e1 i d p c e2 e1 i d p b h a g f k e2 e1 i d p b g f k e2 e1 i d p b k e2 Figure 1.5 Semantic association between two entities e1 and e2

14 While searching semantic association between entities, some of them may be relevant while others may be irrelevant to the users according to their perspective. Hence, suitable methodology is required to improve the precision of the ranked results by filtering or ranking lower the irrelevant paths according to their domain of interest. 1.5.1 Objectives Based on the points of the motivation outlined above, the objectives of this research are defined as follows: Apply the personalization mechanism to identify the users interest level in various domains in order to improve the precision during the ranking process. Evolve a methodology to find the semantic association paths between entities, which pass only through user-specific intermediate entities. Adapt an appropriate method to find out the contexts that are closer to the user s specification. 1.5.2 Limitations The following are the limitations of the research work: 1. In the proposed methods, it is assumed that the RDF data storage already exists. 2. Calculating trust weight is out of the scope of the current work.

15 3. The proposed method considers only semantic connectivity associations. 1.6 ORGANIZATION OF THESIS This thesis is divided into six chapters. The first Chapter consists of the introduction. The ranking metrics which influence the relevance of ranking results are discussed in Chapter two. Chapter three gives the literature review. The fourth chapter presents personalization approach in ranking semantic association. Experimental results of this approach is presented and discussed in this chapter. Chapter five describes the approach to improve the precision in semantic association paths which pass through a user-specific intermediate node. It includes motivation of the approach, bi-directional BFS algorithm and discussion of the experimental results. The sixth chapter presents the context closeness approach in finding relevant semantic associations. It describes the motivation of the problem, methodology to implement the context closeness during ranking process and the experimental results. Finally Chapter seven presents the final conclusions and the scope for future work.