Semantic Indexing of Algorithms Courses Based on a New Ontology EL Guemmat Kamal 1, Benlahmer Elhabib 2, Talea Mohamed 1, Chara Aziz 2, Rachdi Mohamed 2 1 Université Hassan II - Mohammedia Casablanca, Faculté des Sciences Ben M sik, Laboratoire de Traitement de l Information, Cdt Driss El Harti,BP 7955 Sidi Othman Casablanca, Maroc k.elguemmat@gmail.com, taleamohamed@yahoo.fr 2 Université Hassan II - Mohammedia Casablanca, Faculté des Sciences Ben M sik, Laboratoire de Technologie de l Information et Modélisation, Cdt Driss El Harti,BP 7955 Sidi Othman Casablanca, Maroc h.benlahmer@gmail.com, aziz.chara@hotmail.fr, mohamed.rachdi@yahoo.fr Abstract: Since the publication of the indexation documents and queries are represented by keywords from their content. The use of words to represent the document content and query generates several problems, the ambiguity of words and their disparity. The semantic indexing is as a solution that answers these problems. The goal is to index by the meaning of words rather than words. In a context where the ambiguity is present, the semantic indexing is meant to improve the performance of IRS (Information Retrieval System). In this sense we will soon overcome the problems of traditional indexing approaches. What we propose is a new approach that will allow semantically indexing algorithms courses written in French language, based on a new application ontology. The aim of our approach is to adjust a semantic annotation tool with the reference ontology. The semantic annotation tool we generate an index that will be used in e-learning as needed (question answering systems, information retrieval systems...) while improving performance on the field. Keywords: semantic indexing; algorithms courses; french language; ontology; e-learning. 1. Introduction The problem in e-learning is to facilitate the learning task to involve actors (learners, trainers...). we will continue in this direction to define a method of semantic indexing algorithms courses written in french language that will be used in e-learning. however several limitations confront this area. In terms of indexing has been several approaches but how to adopt as part of e-learning with a transparent, fluid method, adapting to the field of teaching algorithms. The field of information retrieval since the early 1950s [1] whose goal is to find documents (text, a piece of text, web page, image, video) relevant (the user must be able to find the information of which he needs) to a user request (express the need for information of a user), from a large database. The first problem that interested the researchers focused on the indexing of documents to find them, this research has known much progress, which is valued by the arrival of semantic indexing. The goal is to take advantage of semantic indexing, to overcome the limitations posed by traditional indexing of algorithms courses, based on a new ontology which we called OntAlgO and an approach that determines the document to be indexed, identifies key concepts and finally generates the index that characterizes the exact meaning of the course that will be operated in e- Learning as needed (question answering systems, information retrieval systems...). This article is distributed as follows: we define in the next section the issues of indexing, we identify the boundaries of traditional indexing and semantic indexing passage, the later section presents our contribution in the field we have presented our ontology(ontalgo) and our approach to index semantically algorithms courses, the last section will present conclusion and perspectives. 2. The Challenges of Indexing The purpose of indexing is to find representatives of the concept or the most important concepts in a document, 354
those represented in the re-use by an information retrieval system, for easier comparison representations of the query and the document [2]. Among the most important approaches to indexing, we find classic which will be explained with these strengths and weaknesses in Section 2.1 and the passage to semantic indexing will be presented in Section 2.2. 2.1.Classic Indexing Indexing can be done by different means: Manual: The document is analyzed by a human expert in the field, Automatic: a fully automated process, Semi-automatic: it is based primarily on automatic mode. However the final choice remains to the expert in the field to select the significant terms. The automatic indexing mode provides benefits, given its ability to automate the indexing process. It includes several treatments on the documents : automatic extraction of descriptors, the use of an antidictionary to remove function words, stemming, the identification of groups of words, the weighting of words before creating the index. This set of weighted terms will be used to form a representation of the contents of the document, these terms are organized into a representation It depends on the model of IR (Information Retrieval) that we use (Boolean models, vector models, probabilistic models) [3] Among the problems that confront the traditional indexing, the ambiguity of words and their disparity. Textual entities that represent documents and queries are specified by keywords from the content [4]: The ambiguity of words, called lexical ambiguity, refers to words and lexically identical with different grammatical functions, it is generally divided into two types, the syntactic ambiguity and semantic ambiguity. The disparity of words (word mismatch) refers to different lexical words have the same grammatical function. Various solutions are proposed to overcome the limitations of traditional indexing: As a solution to the problem of the ambiguity of words, is to use compound expressions ([5], [6]), to reduce ambiguity. Yet it is not always possible to provide a compound expression in the query that meets the desired direction, and the formulation of expressions requires a great effort from the user. Solution to the problem of disparity of words, is to expand the query using a thesaurus of synonyms [7]. To add a word in the query by its synonyms, we must not only know the word in the query, but also the word that is used to extend [8]. As part of IR, A new type of indexing appeared to overcome the limitations of traditional indexing, called semantic indexing, which will be explained in the next section. 2.2.Semantic Indexing The semantic indexing provides outcomes at the representation of documents and queries. This is a specialization of traditional indexing, according to [3], the goal is to index by the meaning of words rather than by words. In a context where the ambiguity is present, the semantic indexing is meant to improve the performance of IRS. The semantic indexing focuses on two main phase [4]: Disambiguation phase: find the correct meaning of each word in the document (respectively query). Representation phase: to represent the document (this query respectively). We have several approaches to disambiguation, we find those based on training corpus to compute the correct meaning of a word and there are others who rely on the exploitation of the local context and definitions from external linguistic resources such as dictionaries or computerized MRD (Machine Readable dictionary), thesaurus, ontologies, or a combination of them. Among the approaches to representation, we have either a representation based on the senses or a combined representation key-words /sense. Since the 90s, ontologies have became a research subject at the heart of different communities, including artificial intelligence, semantic web, software engineering, biomedical informatics, or the information architecture, etc.. The reason for this popularity is partly due to the fact that ontology is a controlled and organized vocabulary and corresponds to the explicit formalization of the relations created between the various vocabulary terms. On the other hand it offers a common and shared understanding of a domain, as well as human users and at the level of software applications [9]. In this respect our contribution presented in the next section, benefits the advantages of semantic indexing precisely those presented by the approach based on ontologies for indexing algorithms courses that will be used as part of e-learning. 355
3. Our Contribution in the Field Our contribution focuses on the semantic indexing of e- Learning resources (courses on algorithms). Our solution is based on the approach of ontology for semantically indexing algorithms courses written in french language, Section 3.1 presents the new ontology, OntAlgO, when created, however, Section 3.2 discusses our approach adopted for semantic indexing of algorithms courses. 3.1.Construction of Ontology OntAlgO The division of the domain knowledge of teaching used to classify the knowledge of a specific domain to be taught, this is possible through ontologies which play a crucial role since they model the knowledge through concepts, attributes and relationships are used to index content of the documents. Cutting of knowledge: Our teaching field (algorithms) will be organized around the following concepts: CoA (Concepts of Application): Are the keys concepts that models the algorithms courses. CoU (Concepts of Use): These are surface markers that model a set of knowledge elements to describe the use case of CoA; OntAlgO our ontology is obtained by the classification of knowledge about algorithms when extracted from the course ALGORITHMIQUE ET PROGRAMMATION NON-MATHEUX COURS COMPLET avec exercices, corrigés et citations philosophiques 1, form: Seven CoA (variable, test, boucle, tableau, fonction, procédure, tri) and four CoU (définition, syntaxe, types, exemple) Figure 1 shows our ontology OntAlgO implemented with Protégé 2 editor, where we present the ontology concepts. Figure 1 : OntAlgO implemented with Protégé. The semantic annotation language chosen to define the OntAlgO is OWL, recommended by the W3C 3 in February 2004 is the most expressive ontology language for the Web. It offers features that were not defined by other W3C languages, RDF and RDFS. The objective of ontology development is to implement it in an annotation tool. The formal description of our ontology is designed to prepare its integration into the annotation tool. Attributes and relationships of OntAlgO All the concepts CoA of the ontology have an attribute nom. The concepts CoU of the ontology have an attribute marqueur. The semantic relationships between CoA and CoU are defined by associative relations cf. Figure 2, for example the relationship between CoA variable and CoU définition will be déf de var. Figure 2 : Relation between CoA and CoU. 1 http://www.pise.info/algo/index.htm 2 http://protege.stanford.edu/ 3 http://www.w3c.org 356
Instantiating of OntAlgO The last step is to provide instances of classes in the hierarchy, Table 1 explicit an example for the CoA variable and CoU définition. Table 1 : Example of OntAlgO instantiation. Concept attribute values variable1 nom variable définition 1 marqueur est une définition 2 marqueur sont définition 3 marqueur on définit définition 4 marqueur on utilise définition 5 marqueur permet définition 6 marqueur nom de définition 7 marqueur un ensemble de All concepts CoA have as the value for the attribute nom the same value used to describe the concept. The concepts CoU have several description to identify the use cases of CoA. 3.2.Developed Approach The proposed approach, called indexing algorithms courses, aims to improve the relevance of IR, addresses a new method for disambiguation of semantic descriptors contained in the courses through our OntAlgO. Approach to Semantic Indexing algorithms courses Process followed by our approach cf. Figure 3: We identify firstly the algorithm course to index. An annotation tool processes the document and mark the first CoA before moving to CoU from the reference ontology OntAlgO with this form <CoU> <CoA> </ CoU>, the CoU will be limited between two points (sentence) or with the appearance of a new CoA. The annotation tool identifies the semantic relationship between the CoU and CoA in OntAlgO. Generation of the index. Figure 3 : Approach of Semantic Indexing algorithms courses. Example of semantic indexing of an extract from the course on algorithms Extract from the course on algorithms to index: «Pour employer une image, une variable est une boîte, que le programme (l ordinateur) va repérer par une étiquette. Pour avoir accès au contenu de la boîte, il suffit de la désigner par son étiquette.» Identification of CoA, CoU and the relationship between them by the annotation tool: CoA: variable and CoU: définition and relationship déf de var. <Définition>Pour employer une image, une <variable>variable</variable> est une boîte, que le programme (l ordinateur) va repérer par une étiquette</définition>. Pour avoir accès au contenu de la boîte, il suffit de la désigner par son etiquette. 357
Finally we have the index, cf. Figure 4, generated by the annotation tool which is based on OntAlgO. Figure4 : Example of index. 4. Conclusion and Outlook Our contribution, semantic indexing of algorithms courses, results a consistent index derived from this approach: The design of a new application ontology OntAlgO. Development of an approach to index algorithms courses. Our result has multiple perspective, in terms of operating over indexed courses, for: Search courses. Search complementary documentation. (1988), Department of Computer Science, Cornell University, Ithaca, New York, pp. 204 210. [7] Salton, G., Fox, E., and Wu, H. Extended Boolean information retrieval. Communications of the ACM, 26(12), 1983. [8] R. KROVETZ and W. B. CROFT. Lexical Ambiguity and Information Retrieval. ACM Transactions on Information Systems, Vol. 10, No 2, pp. 115_141. April 1992. [9] Florence Amardeilh, Web Sémantique et Informatique Linguistique : propositions méthodologiques et réalisation d une plateforme logicielle, DOCTORAT DE UNIVERSITE PARIS X NANTERRE, 2007. 5. References [1] Mooers, C.N., Application of Random Codes to the Gathering of Statistical Information, MIT Master's Thesis, 1948. [2] Catherine roussey, une méthode d indexation sémantique adaptée aux corpus multilingues, DOCTORAT DE Institut national des sciences appliquées de lyon, 2001. [3] Mustapha BAZIZ, INDEXATION CONCEPTUELLE GUIDEE PAR ONTOLOGIE POUR LA RECHERCHE D INFORMATION, DOCTORAT DE INSTITUT DE RECHERCHE EN INFORMATIQUE DE TOULOUSE, 2005. [4] Fatiha BOUBEKEUR-AMIROUCHE, Contribution à la définition de modèles de recherche d'information flexibles basés sur les CP-Nets, DOCTORAT DE L UNIVERSITÉ DE TOULOUSE, 2008. [5] Fagan, Joel L. 1987. Experiments in Automatic Phrase Indexing for Document Retrieval : A Comparison of Syntactic and Non-syntactic methods, PhD thesis, Dept. of Computer Science, Cornell University, Sept. 1987. [6] Salton, G. Syntactic approaches to automatic book indexing. In Proc. of the annual meeting on Association for Computational Linguistics (ACL) 358