PLATFORM OF TRANSCRIPTION THE OLD ARABIC MANUSCRIPTS Noureddine EL MAKHFI Laboratoire de Transmission et Traitement de l Information (LTTI), UFR: Signaux Systèmes et Télécommunications, Faculty of Sciences and Technology, University of Sidi Mohamed Ben Abdellah, Fez, Morocco n.elmakhfi@gmail.com Rachid BENSLIMANE Laboratoire de Transmission et Traitement de l Information (LTTI), University of Sidi Mohamed Ben Abdellah, Fez, Morocco r.benslimane1@gmail.com Abstract : The old manuscripts kept in libraries are a part of the richest cultural heritage and legacy of civilizations. Digitalization is a solution for the preservation of this cultural and historical heritage, which is very difficult to handle for users. On the other hand, restriction of access to national heritage manuscript is related to the concern to preserve the manuscripts physically manipulated which contribute to their accelerated degradation, taking into consideration these limitations on access while ensuring preservation of original manuscripts, the solution widely adopted is based partly on the digitization of this heritage manuscript, and partly on the development of management platforms and diffusion of this wealth of knowledge digitized. We propose in this paper a platform of transcription and establishment by annotating images of manuscripts, these annotations are respecting a XML model. The search in the images of a handwritten document, the rich functionality, intuitive user interface, portability, extensibility and the powerful of the XML technology all make the platform of transcription and establishment an ideal explorer for a specialists and readers of ancient Arabic manuscripts. Keywords: Establishment; Transcription; Manuscripts; Digitalization; Arabic search engine; Metadata; Annotation. 1. Introduction The search for information in Arabic manuscripts is not a simple process, owing to several constraints related to the cursive nature of Arabic script, to the massive presence of diacritical symbols and problems of overlapping between words and lines in manuscripts Arabic. Many research works and projects devoted to the processing of ancient and modern manuscripts [1]: The BAMBI (Better Access to Manuscripts and Browsing of Images) project [2] has produced a hypermedia system allowing historians, and more particularly codicologists and philologists, to read manuscripts, transcribe them, write annotations, and navigate between the words of the transcription and the matching piece of image in the numerized picture of the manuscript. The DEBORA (Digital access to Books Of RenAissance) project [3] tried to define the needs related to the use of virtual libraries and electronic books. The aim of this project was to develop tools for the digitization and the access to a selection of books of the 16th century. EAMMS (Electronic Access to Medieval Manuscripts) project [4] has developed guidelines for encoding and storing catalog descriptions of medieval and Renaissance manuscripts in electronic form. MASTER (Manuscript Access Through Standards for Electronic Records) [5]. This is a generic, robust and flexible enough to allow its application in different areas of the description of the manuscript. The chosen technology is based on international standards SGML (Standard General Markup Language) and XML (extensible Markup Language). Access to patrimonial documents, including old Arabic manuscripts, involves creating a database of images of these manuscripts. The realization of such a database requires a scanning operation followed by the creation of metadata specific to these manuscripts. Our work in this field, aims to facilitate information retrieval by identifying Arabic manuscripts by metadata and annotations in the images of pages. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5358
We developed in a recent work a method [6] entitled "Searching in Arab manuscripts using metadata and annotations" to handle old manuscripts, this method is based on the SDX platform [7]. The drawback of this method is related to the manual creation of XML files associated to manuscripts. The generation of XML files needs collaborative effort of several experts to annotate, transcribe, and establish the content of these old documents before the webcast. To overcome this drawback, we propose in this paper a new platform that facilitates transcription, establishment of the old Arab handwritten text in the form of annotations by respecting a XML model, thus with this work the rare manuscripts will be available. 2. Structure of handwritten documents 2.1. Metadata Indexes by librarians in order to describe existing documents have a long history. Records are data used to qualify other data like book contents; that is why they are called metadata. We can encode some essential information about the documents in a clear fashion: title, author, date of publication, keywords, etc. Table 1 summarizes several categories of metadata used in our work as reported in [8]. 2.2. Schema Diagram Table 1. Metadata. N Adopted metadata N Adopted metadata 1 Author Types of 5 Material study Copyist Dating of paper of document Name of possessor Title of manuscripts Koran Other religious 2 Title Title of chapters Title of sub-chapters texts Scientists Islamic Jerusalem Medicines Preislamique Literary IV -X centuries 6 Type of Moslem Studied VI -VII centuries manuscript juridical 3 period Medieval Islamic Philosophical (VII -XV ) Histories All periods Etc 7 Study of the writing Style 4 Category of Manuscripts Arabo-christian Arabo-islamic 8 Decoration of the texts Enluminures Illustrations Frontispice 2.2.1. Root Element The root element of the XML file is called msdescription and is an element that may reasonably appear either within the body or the header of a TEI (Text Encoding Initiative) [9] conformant document. In the former case, where the document being encoded is essentially a collection of manuscript descriptions, the <msdescription> element may be placed anywhere a paragraph might appear. In the latter case, where the description forms part of the metadata to be associated with a digital representation of some manuscript original, whether as a transcription, as a collection of digital images, or as some combination of the two, the <msdescription> element has the following components, of which only the first is mandatory. This has seven elements, each with its own complex type. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5359
Fig. 1. Schema diagram for msdescription Element. 2.2.2. Msidentifier Element This element includes all the elements which allow the identification of the manuscript or fragment of manuscript. Fig. 2. Schema diagram for msidentifier Element 2.2.3. Physdesc Element Under the general heading of `physical description' we subsume a large number of different aspects generally regarded as useful in the description of a given manuscript. These include aspects of the form, support, extent, and quire structure of the manuscript object; aspects of the writing... Fig. 3. Schema diagram for physdesc Element. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5360
2.2.4. History Element Groups elements describing the full history of a manuscript or manuscript part. Fig. 4. Schema diagram for history Element. 2.2.5. Mscontents Element The <mscontents> element is used to describe the intellectual content of a manuscript or manuscript part. It comprises either a series of informal prose paragraphs or a series of more structured <msitem> elements, each of which provides a more detailed description of a single item contained within the manuscript. Fig. 5. Schema diagram for mscontents Element. 2.2.6. Logicstruct Element This element describes the logic structure of the document, contents, parts, chapter, etc. Fig. 6. Schema diagram for logicstruct Element. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5361
2.2.7. Admininfo Element A variety of information relating to the duration and management of a manuscript may be recorded as simple prose narrative tagged using the standard <p> element. Fig. 7. Schema Diagram for Admininfo Element. 2.2.8. Additional Element Groups other related information about a manuscript, in particular, administrative information relating to its current location, additional materials associated with it, etc. Fig. 8. Schema diagram for additional Element. 3. Proposed method for indexing and annotating manuscripts 3.1. Principle of the proposed method The platform offers the interfaces customizable of images manuscript and provides searching and browsing functionality for users of Arabic manuscripts. This platform is based on XML representation for metadata, annotation and indexing the pages of the manuscript. Fig. 9. Digitization, processing, indexing and searching in handwritten documents. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5362
3.2. Adding new documents: Images of manuscript document to be created are stored in folders; the browse button identifies the path of pages. The form offers the option of completing the values of metadata fields, the shutter choice of the topic determines and produces topics if they did not exist according to the schema XML [6]. (Library, book title, copier, support...) Fig. 10. Adding books according to a XML scheme. 3.3. Technique for Image Document Annotations In order to allow information retrieval by content, the readers can contribute to the creation of text annotations by selecting areas of the image of manuscript document [10], and they can also associate links, attach files (pdf, doc, XML, wav, mp3...). Fig. 11. Schema diagram for annotation. Annotated pages with different icons Fig. 12. Annotation of a page from a handwritten document [11]. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5363
3.3.1. Textual annotation pages: An annotation created can be saved in XML format; it can be used as supporting data to search by content in the manuscript. Fig. 13. Results of research by the textual annotation. 3.3.2. Graphical annotation of pages We propose to establish documentary research according to the geometrical annotations' on the same level as of the textual annotations. In addition, we have created relationships between geometric annotations and text information specific to each type of annotation. The information has been saved in XML format to serve as search keywords. Fig. 14. Graphical Interface annotation of a page of a manuscript document [11]. 4. Conclusion We presented in this paper, a new platform of transcription and establishment by annotating images of Arabic manuscripts, these annotations are respecting a XML model. Our platform offers the search in the images of Arab handwritten document by using both metadata and annotations. Improvement of the interesting obtained results can be realized by using word spotting method, which is a semi-automatic annotation method. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5364
References [1] Stéphane Nicolas, Thierry Paquet, Laurent Heutte, 2003. Digitizing Cultural Heritage Manuscripts : the Bovary Project. in ACM Symposium on Document Engineering, ACM Doc Eng 2003, Grenoble, France, pp. 55-57. [2] CALABRETTO, Sylvie; BOZZI, Andrea; PINON, Jean-Marie, décembre 1999. Numérisation des manuscrits médiévaux : le projet européen BAMBI, in: Actes du colloque Vers une nouvelle érudition: numérisation et recherche en histoire du livre, Rencontres Jacques Cartier, Lyon. [3] DEBORA: projet européen n. LB 5608 A. Coordinateur R. Bouché, juin 2000.179 pages. [4] http://www.hmml.org/eamms/index.html. [5] BURNARD, Lou. ; ROBINSON, PETER. Vers un standars européen de description des manuscrits : le projet Master. Document numérique. 1999, vol 3, no 1-2, p.151-169. [6] O. El bannay, R.Benslimane, N. El makhfi and N. Rais Searching in Arab Manuscripts Using Metadata and Annotation European Journal of Scientific Research ISSN 1450-216X Vol.28 No.1 (2009), pp.155-164 EuroJournals. [7] http://sdx.archivesdefrance.culture.gouv.fr/gpl/navimages/ [8] Hala kaileh, 2004. L accès à distance aux manuscrits arabes numérisés en mode image. Thèse présentée devant l université Lumière Lyon II. [9] http://www.tei-c.org/index.xml [10] Bertrand Coüasnon, Ivan Leplumey. A Generic Recognition System for Making Archives Documents accessible to Public. Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 2003 IEEE. [11] Kitab nakhla, G582, Edition av 1862 National library Rabat Morocco. ISSN : 0975-5462 Vol. 3 No. 6 June 2011 5365