Next generation knowledge access

Size: px

Start display at page:

Download "Next generation knowledge access"

Luke Whitehead
5 years ago
Views:

Next generation knowledge access John Davies, Alistair Duke, Nick Kings, Dunja Mladenić, Kalina ontcheva, Miha Grčar, Richard enjamins, Jesus Contreras, Mercedes lazquez Civico and Tim Glover

Design/methodology/approach Emerging trends in knowledge access are considered followed by a description of how ontologies and semantics can contribute.

For each of these tools a detailed description of the approach is given together with an analysis of related and future work as appropriate.

presenting to them in a form that is appropriate to their current context.

The tools will be integrated and trialled as part of case studies within the SEKT project. This will allow their usability and practical applicability to be measured.

Knowledge management, and in particular knowledge access, will benefit from their widespread acceptance.

Originality/value The paper presents research in an emerging but increasingly important field, i.e. semantic web-based knowledge technology.

Keywords Knowledge management, Worldwide web, Semantics, Search engines Paper type Research paper 1. Introduction (Information about the authors can be found at the end of the article.

and partially by the Slovenian Research Agency. This publication only reflects the authors views.

1 Next generation knowledge access John Davies, Alistair Duke, Nick Kings, Dunja Mladenić, Kalina ontcheva, Miha Grčar, Richard enjamins, Jesus Contreras, Mercedes lazquez Civico and Tim Glover Abstract Purpose The paper shows how access to knowledge can be enhanced by using a set of innovative approaches and technologies based on the semantic web. Design/methodology/approach Emerging trends in knowledge access are considered followed by a description of how ontologies and semantics can contribute. A set of tools is then presented which is based on semantic web technology. For each of these tools a detailed description of the approach is given together with an analysis of related and future work as appropriate. Findings The tools presented are at the prototype stage but can already show how knowledge access can be improved by allowing users to more precisely express what they are looking for and by presenting to them in a form that is appropriate to their current context. Research limitations/implications The tools show promising results in improving access to knowledge which will be further evaluated within a practical setting. The tools will be integrated and trialled as part of case studies within the SEKT project. This will allow their usability and practical applicability to be measured. Practical implications Ontologies as a form of knowledge representation are increasing in importance. Knowledge management, and in particular knowledge access, will benefit from their widespread acceptance. The use of open standards and compatible tools in this area will be important to support interoperability and widespread access to disparate knowledge repositories. Originality/value The paper presents research in an emerging but increasingly important field, i.e. semantic web-based knowledge technology. It describes how this technology can satisfy the demand for improved knowledge access, including providing knowledge delivery to users at the right time and in the correct form. Keywords Knowledge management, Worldwide web, Semantics, Search engines Paper type Research paper 1. Introduction (Information about the authors can be found at the end of the article.) This work was supported by the IST Programme of the European Community under SEKT, Semantically Enabled Knowledge Technologies (IST IP) and PASCAL Network of Excellence (IST ); and partially by the Slovenian Research Agency. This publication only reflects the authors views. Today, we can observe a number of emerging trends in technologies for intelligent knowledge access, including developments in search engines, categorisation tools and visualisation systems. This paper gives a brief overview of them, describes ongoing efforts to develop semantic web-based knowledge access tools, and discusses how a semantic web-based approach can provide a coherent framework to address many of these emerging requirements. 1.1 Trends in knowledge access A number of trends can be discerned in the knowledge access marketplace as vendors and users alike start to think beyond Google. It is instructive to review these briefly, since the work described in the remainder of the paper addresses all of these issues via a single coherent technology framework: the semantic web. Desktop search Google are moving to support searching the desktop. This is because Microsoft is moving into search in a big way and currently they have the advantage that PAGE 64 j JOURNAL OF KNOWLEDGE MANAGEMENT j VOL. 9 NO , pp , Q Emerald Group Publishing Limited, ISSN DOI /

2 they are well versed in processing desktop formats. So Microsoft is moving into Google s space, and vice versa. Categorisation as ranking quality increases, it is likely that the relative differences between different search algorithms from different vendors will get smaller. Therefore a new differentiator has to be found and one possibility is organising the results of a search for the user by category (e.g. Verity, clusty.com). Integrated search future searches may not be initiated by visiting a webpage separate from your application but rather by, for example, highlighting a chunk of text in a Word document and right-clicking. This is an area Microsoft would hope to dominate by embedding its search capability into Office applications. What s the advantage? Reduced user overhead required to initiate searches. Seamless search this involves firing off implicit queries based on user activity. See blinx.com. This means less overhead is required to access information (you don t have to stop what you re doing). Ideally, this will combine a search of the desktop and the web (and other areas to which the user has access, e.g. networked drives, public folders, etc.). Personalised search tweaking the search based on a user s prior searches or a personal profile of some kind. Microsoft have Stuff I ve Seen (which could be used to derive a profile) and active folder research, which will add items, such as files, links, etc. to a folder that are found (on the desktop or the web) to be relevant to content of that folder (i.e. expanding it virtually). eyond search some vendors are aiming to add intelligent, sub-document analysis of results (e.g. Corpora s Jump!) Why? So as to not just give the user a long list of documents but also help with the next step the analysis of the returned information. This allows the user to browse digests of the information based on different topics/categories found in the results list. A9.com (from Amazon) provides a meta-search over selectable sources where the results are split into categories such as books, movies, images, references, etc. The user s bookmarks can also be searched forming another category. Visualisation visualisation of search results generally means using 2D or 3D representations of the search results and/or topics that they have been classified against. This can be useful because it allows the user to quickly grasp the results and the categories. The most important topics might be represented by larger icons, drawing the user to them first. Examples of visualisation-based search tool include webbrain.com and kartoo.com Device independence knowledge workers use an increasingly sophisticated and diverse range of devices and expect to be able to access information wherever and whenever they are. As well as PCs (desktop and laptops), mobile phones (including SMS messaging, WAP browsing and use of 3G multimedia capability) and various PDA device types are now commonly used. 1.2 Role of ontologies and semantics All of the trends identified above can be further enabled or enhanced by the application of semantic technology. As discussed in more detail by Davies and Sure (2005), in this issue, the semantic web (erners-lee et al., 2001) provides enhanced information access based on the exploitation of machine-processable metadata. Central to the vision of the semantic web are ontologies. These are seen as facilitating knowledge sharing and re-use between agents, be they human or artificial (Fensel, 2001). They offer this capability by providing a consensual and formal conceptualisation of a given domain. As such, the use of ontologies and supporting tools offer an opportunity to significantly improve knowledge management capabilities on the intranets of organisations and on the wider web. It is generally accepted that search engines based on conventional IR techniques (employing keyword and phrase matching between the query and index) alone tend to offer high recall and low precision. The user is faced with too many results and many results that VOL. 9 NO j j JOURNAL OF KNOWLEDGE MANAGEMENT PAGE 65

3 The semantic web provides enhanced information access based on the exploitation of machine-processable metadata. are irrelevant. The main reason for this is the failure to handle polysemy (a word that has two or more similar meanings) and synonymy (a word that has the same meaning as another word). The use of ontologies and associated metadata can allow the user to more precisely express their queries, thus avoiding the problems identified above. Users can choose ontological concepts to define their query or select from a set of returned concepts following a search in order to refine their query. This can improve the accuracy of a search and searching can also be extended, as we will see, by the use of a user profile or the context of a search making searching personalised and aiding integrated and seamless searching. Furthermore, the use of semantic technology offers the prospect of a more fundamental change to knowledge access: current technology supports a process wherein the user attempts to frame an information need by specifying a query in the form of either a set of keywords or a piece of natural language text. It is interesting to note that, despite the benefits claimed by some vendors for natural language querying [1], the average search engine query length on the web in a recent survey was just 2.2 words[2]. Having submitted a query, the user is then presented with a ranked list of documents of relevance to the query. The techniques for ranking documents have been the subject of more than 30 years research and are well understood, publicly available and differ relatively little in terms of performance, notwithstanding the claims of some search engine vendors. In short, today s search engines are much of a muchness. It is suggested here, therefore, that the future of search engines lies in supporting more of the information management process, as opposed to seeking incremental and modest improvements to relevance ranking of documents. In this approach, software supports the process of actually reading and analysing relevant documents, rather than merely listing them and leaving the rest of the information analysis task to the user. Corporate knowledge workers need information defined by its meaning, not by text strings ( bags of words ). They also need information relevant to their interests and to their current context. They need to find not just documents, but sections and information entities within documents and even digests of information created from multiple documents. As described below, the exploitation of metadata and ontological information can offer this information-centric approach, as opposed to the prevailing document-centric technology. The generation of ontologies and the creation of metadata attributing information to them is obviously key to the success of these advanced knowledge access approaches (see Cunningham and ontcheva, 2005, in this issue). Techniques allowing (semi-)automatic creation of these are under development. This paper considers automated profile construction while the related wider problem of knowledge discovery is considered by Grobelnik and Mladenić (2005) in this issue. 1.3 Overview of rest of the paper The remainder of this paper describes a number of ongoing efforts to develop semantic web based knowledge access tools mainly coming from the SEKT project[3]. Our vision is to develop and exploit the knowledge technologies which underlie next generation knowledge management. We envision knowledge workplaces where the boundaries between document management, content management, and knowledge management are broken down, and where knowledge management is an effortless part of day-to-day activities. Appropriate knowledge is automatically delivered to the right people at PAGE 66j j JOURNAL OF KNOWLEDGE MANAGEMENT VOL. 9 NO

4 the right time at the right granularity via a range of user devices. A number of systems have been developed to help realise this vision. The first of these, described in Section 2, considers how information delivery can be personalised based upon automatically constructed user profiles. The constructed profile is used to enable browsing of the user history in an interest-focused way. The user is able to see which part of an ontology are related to their current browsing focus and also which recently viewed pages are relevant to a selected concept from the ontology. The search and browse system in Section 3 describes an approach to provide search agents that use ontology-based queries incorporating named entities (e.g. search for a person named Nick Kings in an organisation named T ). The agents periodically crawl the web and update the users with new results found. The next section is concerned with the visualisation of ontologies with a view to supporting browsing (Section 4). The approach is unique in that it considers visualisation from the point of view of the user rather than focusing on the technical aspects. The knowledge generation section (Section 5) describes an approach to generate natural language from ontological data. This can provide automated documentation of ontologies and knowledge bases and unlike human-written texts, the automatic approach will constantly keep the text up-to-date which is vitally important in the semantic web context, where knowledge is dynamic and is updated frequently. The natural language generation approach also allows generation in multiple languages without the need for human or automatic translation. Finally the paper considers device independence in Section 6, the aim of which is to provide an effective user interface to a web application for devices with widely varying capabilities, without having to write a separate site for each class of devices 2. Personalisation With more than 8 billion documents on the web and billions more in corporate and government intranets, personalised information delivery based on user and document profiling is an important step in providing relevant and timely knowledge to the right people, a key concern of knowledge management. User profiles, which aim to model the user s information requirements, are central to personalised information delivery. 2.1 Profile construction A user profile, which is used as a basis for personalisation, can be constructed manually, semi-automatically or fully automatically. Manual approaches to constructing user profiles usually rely on the user or domain expert. The profile is provided by a human in a form of rules, filters, scripts, etc. (e.g. filters for sorting incoming s into the user s folders). Automatic and semi-automatic approaches, on the other hand, rely on a system that is usually capable of capturing user characteristics based on observing the user s behavior, in some cases requiring feedback or guidance from the user. In this paper we address automatic approaches to user profiling. User profiles can be automatically constructed from different data sources using a variety of techniques including content-based user profiling and collaborative user profiling. Content-based user profiling is usually applied on problems involving text documents (i.e. the user is accessing and reading text documents) where the content analysis of the document text is performed in order to construct a profile. For instance, content analysis is used for providing help to the user in web browsing by highlighting hyperlinks of documents similar to the already requested documents (Mladenić, 2002). Collaborative user profiling is based on the assumption that similar users have similar preferences. In other words, by finding users that are similar to the active user and by examining their preferences, the recommender system can predict the active user s preferences for certain items and provide a ranked list of items which the active user will most probably like. Collaborative user profiling generally ignores the form and the content of the items and can therefore also be applied to non-textual items. Furthermore, it can detect relationships between items that have no content similarities but are linked implicitly through the groups of users accessing them. These groups (communities) are formed around a specific user profile. VOL. 9 NO j j JOURNAL OF KNOWLEDGE MANAGEMENT PAGE 67

5 User profiles can be represented in different ways ranging from simple filters to statistical models (either numerical or symbolic) of a respectful complexity. The profile representation in the case of automatic profile construction mainly depends on the technique used for the profile construction. In this paper we concentrated on profiles represented by ontologies. 2.2 Representing profiles in ontologies Several researchers have developed approaches to user profiling that represent profiles in some kind of ontology. A topic ontology in the form of a tree-like hierarchy of the user interests was proposed in by Kim and Chan (2003), with the root being the user s general interest (i.e. long-term interest) and the leaves representing domains the user is or was ever interested in (i.e. short-term interests). User interest hierarchies are built using a form of hierarchical clustering on a set of web pages visited by a user. A similar approach was used by Grčar et al. (2005) for enhancing usage of the user browsing history, as described later in this section. Another way of constructing a user profile is to analyse the user s browsing history and apply modified collaborative filtering techniques (Sugiyama et al., 2004). Here, the user profile is also a combination of both the user s persistent preferences (long-term preferences) and the user s ephemeral preferences (short-term preferences, or today s preferences) and is represented as a vector of term weights. Modified collaborative filtering is then applied to a user-term matrix (in contrast to being applied to a user-item matrix as is the case with the original collaborative filtering approach hence the word modified ) to predict the missing term weights in each user profile. Clustering is used (in one of their approaches) to determine user communities. Cluster centroids are compared to the active user s term vector to find the user s neighborhood (a threshold is used to discard less relevant communities). The latter approach, according to Sugiyama et al. (2004), achieves the best results. In the Foxtrot recommender system (Middleton et al., 2003), an ontology based on the CORA digital library is used new documents are classified into the taxonomy by using a variant of the nearest neighbour algorithm (Mitchell, 1997). A user profile holds a set of topics and their corresponding interest values. Each topic adds 50 per cent of its interest value to its super-class. They also used static knowledge ontologies to alleviate the cold-start problem. The visualisation of profiles is used to encourage immediate user feedbacks. For evaluation, collaborative filtering is performed on a user-topic matrix (they term this technique collaborative and content-based recommendations ). Recently, Grčar et al. (2005) have proposed user profiling for interest-focused browsing history. The system provides a dynamic user profile in a form of topic ontology. After a page is viewed by the user, the textual content is extracted and stored as a text file. A collection of such text files is maintained in two folders. The first folder holds some relatively small number (e.g. five) of the most recently viewed pages (the short-term interest folder). The second folder contains a larger number (e.g. 300) of the last viewed pages (the long-term interest folder). When a page is first visited, it is placed into both folders. Eventually it gets pushed out by other pages that are viewed afterwards. A page stays in the long-term interest folder much longer than in the short-term interest folder (hence the terms long- and short-term), the reason for this being that a much higher number of new pages need to be viewed for the page to be pushed out of the long-term interest folder. The long-term interest pages are treated slightly differently from the short-term interest pages. To construct the user profile in the form of a topic ontology, a variant of hierarchical clustering is performed on the long-term interest folder to obtain the user topic ontology. The root of the topic ontology holds the user s general interest while the leaves represent his/her specific interests. General interest stands for all the topics the user is or ever was interested in, while the term specific interest Central to the vision of the semantic web are ontologies. PAGE 68j j JOURNAL OF KNOWLEDGE MANAGEMENT VOL. 9 NO

6 usually describes one more-or-less isolated topic that is or ever was of interest to the user. The constructed profile is used to enable browsing of the user history in an interest-focused way as follows. The recently visited pages (representing the user s short-term interest) are mapped to the user topic ontology. The mapping reveals the extent to which an ontology node from the user profile (i.e. a set of pages) is related to the user s short-term interest. y highlighting nodes with the intensity proportional to the similarity score, we can clearly expose the topic ontology segments that are of current interest to the user. Due to the highlighting, the user can clearly see which parts of the topic ontology are relevant to his/her current interest. He/she can also access previously visited pages by selecting a node in the ontology which is visualized in the application window. This can be explained as the user s interest-focused web browsing history, the interest being defined by the selected node. Grčar et al. (2005) have developed a system using the described approach, where the user profile is visualized on an Internet Explorer toolbar. In addition to having a visual presentation of his/her long-term interests with highlighted parts of the current interest, the user can select a node in the user profile ontology to get a list of the specific keywords and the associated web pages. To summarise, methods for automatic creation of user profiles and their representation in ontologies are in the process of becoming more mature and ready to be applied in a number of personalised applications, such as ontology-based search and browse, which is discussed next. 3. Search and browse We believe that it is important to develop tools to exploit the dynamic profile information discussed in the previous section, in order to capture a user s information needs. In turn, the tools will relate his or her information needs to the wider community s ontology. Having derived a profile, the issue is to intelligently present relevant information, and route information between two, or more, members of the community. y classifying information against an ontology, our goal is to provide facilities that augment each community member s personal memory, and enhance recall of information at later points in time. Ontologies have the potential to underpin and enable efficient searching, and large scale knowledge sharing by: The identification of communities of interest, within wider communities. It is feasible to identify sets of people with common sets of interests, but it is crucial to understand how these groups form and can be maintained. Using the underlying ontology to identify implicit user needs, and being able to fetch information in advance of a user s explicit query. One of the proposed toolsets is a platform to support searching and browsing, using semantic agents. The problems of outdated indexing and poor search coverage on the WWW are well known (Lawrence and Giles, 1999). Searching for information is also problematic, as conventional search engines tend to have a high recall and low precision[4]. This often results in the user being presented with far too many results in response to their query; many of the results are not relevant to the user s information need. There are a number of reasons for this, the foremost being the failure of the search engines to cope with the fact that words may have two, or more similar, meanings and that several terms are used to describe the same concept. y searching documents, classified against a domain ontology, the search engine is able to disambiguate the terms of the query, and locate information in a more precise manner. Our search and browse prototype is currently based around a centralised server, running an instantiation of the KIM platform (Popov et al., 2004). The KIM platform allows a user to have access to a series of documents that have been annotated against the KIMO ontology (Popov et al., 2004) (see Figure 1). KIM can be considered to be a number of application services to support the automatic semantic annotation, indexing, and retrieval of unstructured and semi-structured content. In VOL. 9 NO j j JOURNAL OF KNOWLEDGE MANAGEMENT PAGE 69

7 Figure 1 Outline architecture essence, the system operates as a typical web crawler, with the added stage of automatically extracting meta-data and annotating local copies of retrieved web pages, in a similar fashion to Armadillo (Ciravegna et al., 2004) and htechsight (Aldea et al., 2005). In addition to the semantic repository being constructed, each user also has a number of agents : these agents regularly search the local semantic index to notify the users that new pieces of information have been found; the agents perform queries against named entities rather than simple string matching, such as used with Google. As each document is stored within KIM, named entities are identified and extracted. New statements are added to the semantic repository as a result of the information extraction process. KIM is able to store explicit and implicit statements. Explicit statements are both about recognised entities and simple relations, such as a position within an organisation or organisation located in a location. Additionally, implicit statements are inferred according to the inherent transitivity of some properties, within the KIMO: if smith is of type X, it is also of type Y if X is a subclass of Y). For example, a web page may contain text about Nick Kings, which is identified as a named entity of type Man ; KIM also infers, through KIMO, that Nick Kings is also of type Person. Furthermore, the KIMO ontology contains a number of custom axioms used to yield more implicit statements, such as properties like subregionof. For example, KIMO can store Munich is in Germany but also infer that Munich is Europe because Germany is in, or more formally is a subregionof, Europe. KIM does not, itself, build direct associations between web pages, but will allow subsequent queries of the form find all documents about Nick Kings, where Nick Kings is a Person. Inference is carried out as pages are stored, rather than at the time when queries are carried out, in order to improve the performance for end users. 3.1 User agents Each user is able to create a number of semantic queries, termed here an agent, and these queries are carried out on a regular basis. Thus, each agent searches for documents that contain entities that match the user s long term interests. The user is, currently, able to express searches to find documents that contain information about the following: a named person holding a particular position, within a certain organisation; a named organisation located at a particular location; a particular person; a named location; and a named company, active in a particular industry sector. PAGE 70j j JOURNAL OF KNOWLEDGE MANAGEMENT VOL. 9 NO

8 Those documents, of course, have to been previously fetched and annotated by the web crawling part of the system. 3.2 Future developments The focus of the development is to provide a user with the ability to have regular content delivery, based around the carrying out of semantic queries, rather than the development of improved web crawling techniques. As this is the first stage prototype, the agents have been quite naive. It is envisaged that the next stages of development will increase the level of sophistication: More sophisticated feedback and learning for the agents. Agent systems, such as ProSearch, have already been used to incrementally build complex queries to represent a user s long term interests (see Davies et al., 1998 and Davies, 2000 for more details). For the next stage of the SEKTagent, as a user reads documents located by the agent s usage, feedback will be collected. From this usage information, complex semantic queries would be created to represent searches such as find information about iotech Ltd, in Germany, but not containing information about the CEO. Ability to update and detect changes to content that has already been added to the semantic repository. Keeping usage information is central to another strand of development. The information contained on the web is not static, with page contents changing on a daily basis. As the agent s role is to represent a user s interests, the agent should be able to identify when, and if, a page has changed and whether the use should be notified about the page contents. Even if a user has already seen a document, the agent must be able to decide whether the page has changed sufficiently to notify the user again. Incorporate background knowledge ontology and user profiles. We are developing an ontology, called PROTON[5], to represent the classes and relationships required to model a knowledge sharing community. Currently, the SEKTagent stores the queries within a simple database. The enhanced SEKTagent will represent the user and the user s interests through the Profile class, within PROTON. It is envisaged that by representing interests in this fashion, further inferences can be made about documents that a user will find useful. For example, a user would be able to state interests in topics, such as metallurgy, rather than having to express interests in terms of complex search strings. 4. Visualisation In addition to the ontology-based knowledge access to unstructured content discussed above, new methods are needed to provide user-friendly visualisation of the ontology itself. This needs to be intuitive, personalisable, and abstracted away from the formal representation of the knowledge as concepts, properties, and axioms. A number of relevant approaches is presented next. 4.1 Existing approaches for visualisation In the survey of semantic web visualisation tools (Sevilla et al., 2004) there is a list of current tools that allow the direct translation of ontologies into browsable formats. Many of those tools exactly reproduce the ontology structure in a visual formalism, without taking into account external constraints, such as usability issues (many ontology concepts have not been designed for visualisation purposes), browsing issues (some concepts, especially those representing relations introduces tedious browsing paths) nor the possibility of applying user defined rules for personalization or even interaction (for instance, in this kind of application cannot visualise the differences between instances from the same concept and detect separated instance groups). Some of the applications of this kind are: Spectacle developed by Aduna (Harmelen et al., 2001), Jambalaya[6], IsaViz[7], Ontorama (Eklund et al., 2002) or RDFSVisualizer[8]. The following are the requirements for the visualisation tool development that we have identified as important: VOL. 9 NO j j JOURNAL OF KNOWLEDGE MANAGEMENT PAGE 71

9 the existence of user profiles for the visualisation of the ontology. These approaches visualise the ontology knowledge as it is stored, without filtering it or transforming it for the different types of users (sometimes the ontology is modelled in ways that should not be presented to non-advanced users, and only the instances of a few classes or the aggregation of instances of several classes have to be presented); the specificities of visualising knowledge and its hierarchical structure, the explicit relationships between them, etc.; and easy configuration of how to present different types of instances. 4.2 Description of the approach The experience obtained as a result of the development of several applications based on ontologies, showed us that the knowledge base as modelled by domain experts and knowledge engineers is not always a good candidate for visualisation as is. Since many relations in this domain were modelled as explicit concepts, navigation became tedious and unfriendly. The main purpose for building ontologies is to provide semantic content for intelligent systems. The knowledge models are designed to offer the appropriate information to be exploited by the software. No visualisation criteria are used to build an ontology and often the information is not suitable to be published as it is, for example: concepts may have too many attributes; when relations are represented as independent concepts (first class objects) the navigation becomes tedious; and concepts to be shown do not always correspond to modelled ones. Therefore we felt a need for explicit visualisation rules that allow the creation of views on the domain ontology, in order to visualize only the relevant information in a user friendly way and filter according to using profiles. We introduced the concept of a visualisation ontology, which makes explicit all visualisation rules and allows easy interface management. This ontology will contain concepts and instances (publication entities) as seen on the interface by the end user, and it will retrieve the attribute values from the domain ontology using a query. It does not duplicate the content of the original ontology, but links the content to publication entities using an ontology query language RDQL[9] or SeRQL[10]. This way, one ontology that represents a particular domain can be visualized through different views, and contains the necessary information about how to interact with those contents. Figure 2 shows the idea of how this approach decouples the publication of the ontology in any kind of visualisation model (be it a set of HTML pages or a 3D model) from the knowledge contained in the domain ontology. The visualisation ontology has next predefined concepts: PublicationEntity concept that encapsulates objects as they will be published in the final application. Any concept defined in the visualisation ontology will inherit from it. PublicationSlot each attribute that is going to appear at the final application should inherit from this concept. PublicationInfo this concept allows defining the mappings between the visualisation ontology and domain ontology, to specify how each of the components of the visualisation ontology will be visualized, what its behaviour will be, etc; and also include what is the user profile of this information. The future of search engines lies in supporting more of the information management process. PAGE 72j j JOURNAL OF KNOWLEDGE MANAGEMENT VOL. 9 NO

Figure 2 Decoupled publishing of domain ontologies using visualisation ontologies With this visualisation ontology we must specify the navigation philosophy, the visualisation aspect (2D, 3D, shapes,

10 Figure 2 Decoupled publishing of domain ontologies using visualisation ontologies With this visualisation ontology we must specify the navigation philosophy, the visualisation aspect (2D, 3D, shapes, etc.), and the interaction with the visualised objects. All these features will depend on the user profile, on the amount of knowledge actually stored in the domain ontology, on the type of knowledge being visualised, on the use by other applications, etc. With respect to the different aspects to be used according to the type of knowledge presented to the user, Figure 3 shows an example of two of the possible graphical presentations that can be used: graphs and art galleries. An example of this approach is to visualise the information related with an Author ( Persona ), their publications or their work at different periods of time. All of this information is distributed in the domain ontology, in different concepts that represent the publications and organizations, and also other concepts that express binary relations between concepts. In order to visualize all information related with an Author, we can define a PublicationEntity (see Figure 4). In Figure 4 we can also see all the PublicationSlots (attributes of Author) that we want to visualize in the final application. Figure 3 Different visualisation contexts according to the type of knowledge presented VOL. 9 NO j j JOURNAL OF KNOWLEDGE MANAGEMENT PAGE 73

Figure 4 Example of PublicationEntity In Figure 5 we can see some examples of PublicationSlots, author s name, his/her photograph ( Foto ) or where he/she studied ( Estudio En ).

domain ontology, an example of this is the PublicationSlot Publications (Obras Publicadas).

11 Figure 4 Example of PublicationEntity In Figure 5 we can see some examples of PublicationSlots, author s name, his/her photograph ( Foto ) or where he/she studied ( Estudio En ). Those attributes can come from several concepts that represent binary relations with specific properties, so it is also necessary to define mappings (written in a specific query language) to the domain ontology, an example of this is the PublicationSlot Publications (Obras Publicadas). Therefore PublicationEntity and PublictionSlot has an associated PublicationInfo, where the user profile, the information about behaviour, geometry, and the mapping to domain ontology are defined. An example of mapping of the concept Author publications (Obras Publicadas) is: SELECT DISTINCT resource,obra,r_obra FROM {r}, rdf:type. {, K:Relacion Creacion. }; [, K:agente_responsable. {resource}]; [, K:creacion_relacionada. {r_obra}, K:referencia. {obra}] WHERE resource like #?resource# USING NAMESPACE K ¼,! In Figure 6, we show some examples of the results at the final application. Figure 6 shows an author and its publications with a 3D graph, taking into account the information provided in the visualisation ontology. Finally, several PublicationInfo can be associated with PublicationEntity and PublicationSlot, with the difference that each one has defined different Contexts (classes that define the scene and contain specific interaction and behaviours). For instance, we can associate a PublicationInfo to the GRAPH context and another to the HALL-LIRARY (as shown in Figure 5 Examples of PublicationSlot PAGE 74j j JOURNAL OF KNOWLEDGE MANAGEMENT VOL. 9 NO

Figure 6 Visualisation of an author (left) and of the author and its publications (right) using the GRAPH context and the cube geometry Figure 7 for the same information shown in the previous

12 Figure 6 Visualisation of an author (left) and of the author and its publications (right) using the GRAPH context and the cube geometry Figure 7 for the same information shown in the previous figures). This is useful to customise the visualisation for different types of users with different skills (expert, non-expert, etc.). 5. Knowledge generation Natural language generation (NLG) takes structured data in a knowledge base as input and produces natural language text, tailored to the presentational context and the target reader (Reiter and Dale, 2000). NLG techniques use and build models of the context and the user Figure 7 Example of HALL-LIRARY context, showing the same publications as those in Figure 6, visualised as books in a library VOL. 9 NO j j JOURNAL OF KNOWLEDGE MANAGEMENT PAGE 75

and use them to select appropriate presentation strategies, e.g. to deliver short summaries to the user s WAP phone or a longer multimodal text if the user is using their desktop.

13 and use them to select appropriate presentation strategies, e.g. to deliver short summaries to the user s WAP phone or a longer multimodal text if the user is using their desktop. In the context of semantic web or knowledge management, NLG is required to provide automated documentation of ontologies and knowledge bases. Unlike human-written texts, an automatic approach will constantly keep the text up-to-date which is vitally important in the semantic web context, where knowledge is dynamic and is updated frequently. The NLG approach also allows generation in multiple languages without the need for human or automatic translation (see Aguado et al., 1998). Generation of natural language text from ontologies is an important problem, firstly because textual documentation is more readable than the corresponding formal notations and thus helps users who are not knowledge engineers to understand and use ontologies. Secondly, a number of applications have now started using ontologies to encode and reason with internally, but this formal knowledge needs to be also expressed in natural language in order to produce reports, letters, etc. In other words, NLG can be used to present structured information in a user-friendly way. There are several advantages to using NLG rather than using fixed templates where the query results are filled in: NLG can use different sentence structures depending on the number of query results, e.g. conjunction versus itemised list; depending on the user s profile of their interests, NLG can include different types of information affiliations, addresses, publication lists, indications on collaborations (derived from project information); and given this variety of what information from the ontology can be included and how it can be presented, depending on its type and amount, writing templates will be unfeasible because there will be too many combinations to be covered. This variation comes from the fact that each user of the system has a profile comprising of user supplied (or system derived) personal information (name, contact details, experience, projects worked on), plus information derived semi-automatically from the user s interaction with other applications. Therefore, there will be a need to tailor the generated presentations according to user s profile. NLG systems that are specifically targeted towards semantic web ontologies have started to emerge only recently. For example, there are some general purpose ontology verbalisers for RDF and DAML þ OIL (Wilcock and Jokinen, 2003) and OWL (Wilcock, 2003). They are based on templates and follow closely the ontology constructs, e.g.: This is a description of John Smith identified by His given name is John... }(Wilcock, 2003). The advantages of Wilcock s approach (Wilcock and Jokinen, 2003; Wilcock, 2003) is that it is fully automatic and does not require a lexicon. A more recent system which generates reports from RDF and DAML ontologies is MIAKT (ontcheva and Wilks, 2004). In contrast to Wilcock s approach, MIAKT requires some manual input (lexicons and domain schemas), but on the other hand it generates more fluent reports, oriented towards end-users, not ontology builders. It also uses reasoning and the property hierarchy to avoid repetitions, enable more generic text schemas, and perform aggregation. Our work extends the MIAKTapproach towards making it less domain dependent and easier to configure by non-nlg experts. A novel dimension is the focus on tailoring the summary formatting and length according to a device profile (e.g. mobile phone, web browser). Another innovative idea is the use of ontology mapping for summary generation from different ontologies. Summary generation in our system (called ONTOSUM) starts off by being given a set of statements (i.e. triples), in the form of RDF/OWL. Since there is some repetition, these triples are first pre-processed to remove already said facts. In addition to triples that have the same PAGE 76j j JOURNAL OF KNOWLEDGE MANAGEMENT VOL. 9 NO

property and arguments, the system also removes triples involving inverse properties with the same arguments as those of an already verbalised one.

14 property and arguments, the system also removes triples involving inverse properties with the same arguments as those of an already verbalised one. The information about inverse properties is provided by the ontology (if supported by the representation formalism). An example summary is shown in Figure 8. The lexicalisations of concepts and properties in the ontology can be specified by the ontology engineer, be taken to be the same as concept names themselves, or added manually as part of the customisation process. For instance, the AKT ontology[11] provides label statements for some of its concepts and instances, which are found and imported in the lexicon automatically. ONTOSUM is parameterised at run time by specifying which properties are to be used for building the lexicon. Summary structuring is done using discourse/text schemas (Reiter and Dale, 2000), which are script-like structures which represent discourse patterns. They can be applied recursively to generate coherent multisentential text. In more concrete terms, when given a set of statements about a given concept/instance, discourse schemas are used to impose an order on them, such that the resulting summary is coherent. For the purposes of our system, a coherent summary is a summary where similar statements are grouped together. The schemas are independent of the concrete domain and rely only on a core set of four basic properties active-action, passive-action, attribute, and part-whole. When a new ontology is connected to ONTOSUM, properties can be defined as a sub-property of one of these gour generic ones and then ONTOSUM will be able to verbalise them without any modifications to the discourse schemas. However, if more specialised treatment of some properties is required, it is possible to enhance the schema library with new patterns, that apply only to a specific property. Next ONTOSUM performs semantic aggregation, i.e., it joins RDF statements with the same property name and domain as one conceptual graph. Without this aggregation step, there will be three separate sentences instead of one bullet list (see Figure 8), resulting in a less coherent text. Finally, ONTOSUM verbalises the statements using the HYLITE þ surface realiser. The output is a textual summary. The overall system architecture is shown in Figure 9 and further details can be found in ontcheva (2005). An innovative aspect of ONTOSUM, in comparison to previous NLG systems for the semantic web, is that it implements tailoring/personalisation based on information from the user s device profile. Most specifically, we developed methods for generating summaries within a given length restriction (e.g. 160 characters for mobile phones) and in different formats HTML for browsers and plain texts for s and mobile phones (ontcheva, Figure 8 Example of a generated summary VOL. 9 NO j j JOURNAL OF KNOWLEDGE MANAGEMENT PAGE 77

Figure 9 Knowledge generation architecture 2005). The following section discusses a complementary approach to device independent knowledge access and future work will focus on combining the two.

15 Figure 9 Knowledge generation architecture 2005). The following section discusses a complementary approach to device independent knowledge access and future work will focus on combining the two. Another novel feature of ONTOSUM is its use of ontology mapping rules (de ruijn et al., 2004) to enable users to run the system on new ontologies, without any customisation efforts. 6. Device independence An increasingly important and frequent requirement towards knowledge access tools is that they are accessible via any web-enabled device. This includes PCs, PDAs, mobile phones, and speech processing devices. In order to meet this objective we have developed a device independent web application framework (DIWAF) written as a Java servlet, which has been designed to support the creation of device independent web sites. The aim of device independence is to provide an effective user interface to a web application for devices with widely varying capabilities, without having to write a separate site for each class of devices. The problem can be broken down into the following steps: identify the capabilities of the current device, taking user preferences into account; select suitable content; and adapt the content to the target device. In the DIWAF prototype, device characteristics are handled using the CC/PP standard, and content is adapted to its target device by user-defined templates. Content selection may be carried out either by the application, or by placing conditions on templates. These processes are described in more detail in the following sections. 6.1 Identifying device capabilities There is no universally accepted and supported method of communicating device requirements to the server. However, the most promising standard is Composite Capability/Preference Profiles (CC/PP)[12], a W3C recommendation. In this standard, the sending device extends each HTTP request with a reference to a default device profile and, optionally, a set of over-rides, or profile-diffs. The default profile reference takes the form of the URL of an RDF document describing the device. In the current implementation of the DIWAF, CC/PP profile information is handled by a standard open source Java implementation produced by the Java Community Process, led by Sun Microsystems. Aspects of the DELI system (utler, 2002) are used to handle default PAGE 78j j JOURNAL OF KNOWLEDGE MANAGEMENT VOL. 9 NO

16 We envision knowledge workplaces where the boundaries between document management, content management and knowledge management are broken down. behaviour if the request contains no CC/PP header. The profile information is made available to the servlet as a collection of attributes, such as screen size, browser name, etc. These attributes can be used to inform the subsequent selection and adaptation of content. However, CC/PP by itself is not sufficient to meet all the requirements. For example, one requirement is to push information to users when new documents in their domain of interest become available, using , SMS or WAP push technology. Since these messages are not a response to an HTTP request, CC/PP cannot be used, and the server must maintain a user profile describing the devices held by each registered user. 6.2 Selecting suitable content ecause different devices have different capabilities it is often necessary to select different content. For example, images should not be sent to a device that cannot display them; short, or abbreviated descriptions may be best for small screen devices, with a fuller description available for larger screens and so on. In many cases the DIWAF software can do this selection automatically, or semi-automatically as described below. In other cases, more sophisticated techniques might be appropriate. For example, the output of a Natural language processing engine might be adapted according to the required text length. In these cases, the DIWAF engine must pass profile information on to the client software generating the data. It is worth noting that the device profiles sometimes need to be interpreted to provide an adequate description. For example, the number of characters that can fit on one line depends on both the physical screen size, and the font size chosen by the user. It is useful to allow author-defined extensions to the profile based on combinations of standard attributes, known as capability classes. 6.3 Adapting the content The hardest problem in achieving device independence is adapting the selected content to the current device. The output must be in a suitable language, and must specify the required geometrical layout and style. In principle, artificial intelligence techniques could be used to understand the information to be presented and construct a suitable representation of it on the fly, possibly under the guidance of style rules. However, this is still a matter for research, and for the current prototype, simpler techniques are used. One thrust of research is to add metadata to the content describing its structure. For example, the metadata may be used to label headings and subheadings, input controls, and blocks of text. Different software drivers can then use this information to create an effective layout. HTML is a good example of this approach, and it has proved successful for many years. Technology such as CSS Media Queries (which allows selection of different elements based on CC/PP characteristics), and new small screen rendering technology, together with extensions to HTML such as XFORMS, promise to push the boundaries further. The main objection to this approach is that in order to make use of the full capabilities of each device they need to be given different metadata. A page that has been carefully constructed to look effective on a PC will not necessarily be a good starting point for speech generation. VOL. 9 NO j j JOURNAL OF KNOWLEDGE MANAGEMENT PAGE 79

User Profiling for Interest-focused Browsing History

User Profiling for Interest-focused Browsing History Miha Grčar, Dunja Mladenič, Marko Grobelnik Jozef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia {Miha.Grcar, Dunja.Mladenic, Marko.Grobelnik}@ijs.si